Older blog entries for lmb (starting at number 106)

My colleague Tim has drawn awesome cartoons to illustrate my last cluster zombie story on why you need STONITH (node fencing). Clusters and the undead, I spot an upcoming theme for my stories ...

30 Mar 2010 (updated 30 Mar 2010 at 11:04 UTC) »

Why you need STONITH

A very common fallacy when setting up High-Availability clusters - be it on Pacemaker + corosync, Linux-HA, RedHat Cluster Suite, or else - is thinking that your setup, despite all the warnings in the documentation or in the logfiles, does not require node fencing.

What is node fencing?

Fencing is a mechanism by which the "surviving" nodes in the cluster make sure that the node(s) that have been evicted from the cluster are truly gone. This is also referred to as node isolation, or, in a very descriptive metaphor, STONITH ("Shoot the other node in the head"). This mechanism is not just "fire and forget", but the cluster software will wait for a positive confirmation from it before proceeding with resource recovery.

But it has already failed, otherwise it would not have been evicted, so why would this be necessary, you ask?

The key here is the distinction between appearances and reality: a complete loss of communication with a node looks to all other nodes as if the node has disappeared. Since you, like the obedient administrator that you are, have configured redundant network links, the chance for this to happen is really slim, right? But that is not the only possible cause. In fact, it might still be around, just waiting to come out of a kernel hang, or hiding behind firewall rules, to spew a bunch of corrupted data to your shared state.

In short, node fencing/isolation/STONITH ensures the integrity of your shared state by turning a mere, if justified, suspicion into confirmed reality.

(Pacemaker clusters also use this mechanism for escalated error recovery; if Pacemaker has instructed a node to release a service (by stopping it), but that operation fails, the service is essentially "stuck" on that node. The semantics of the "stop" operation mandate that it must not fail, so this indicates a more fundamental problem on that node. Hence, the default process then would be to stop all other resources on that node, move them elsewhere, and fence the node - rebooting it tends to be rather effective at stopping anything that might have been stuck. This can be disabled per-resource if you don't want some low-priority failure to shift high-priority resources around, though.)

This is all very technical. So let me tell you a story with several possible endings to illustrate.

Story time!

Once upon a time, three friends were sitting huddled around a fire, peacefully eating their cookies. It was a tough time: the world was out to get them, a zombie infection was spreading, they couldn't trust anyone outside their trusted cluster of friends. They were always watchful and paid attention to each other.

Suddenly, one of the three stops responding to the conversation they were having. How do you proceed?

  1. My cluster of friends does not require such a crude mechanism! He'll be careful not to have been infected! If he stops responding, he will simply be dead! You ignore the problem, but then your former friend revives, spreads his infection to your cookie stack, starts clobbering you with a club to eat your brains, and his howl gives away your location to all his new friends, who come down on you with the intent of eating your brains.
  2. You use an unloaded gun to shoot your friend - the trigger responds reassuringly. Your former friends revives, and it is all about eating your brains again.
  3. You kindly tap your friend on the shoulder, and suggest that he please commit suicide. Your former friend revives, snaps at your tapping hand, and starts eating your brains.

  4. You speak a pre-agreed upon code word, a tiny bomb goes off in the head of your friend, blows his brains out, and he drops on the spot. The grue does not eat you. (In fact, the mechanism monitoring his brain probably has already blown him up, but you speak the code word anyway to make sure.)

  5. You take that crude, trusty shotgun and blow his brains out, aiming away from the stack of cookies. The grue does not eat you.

So what?

In order, we have gone through the "I do not need STONITH or have disabled it", "I used the null mechanism intended only for testing", "I used an ssh-based mechanism", or the recommended "a poison-pill mechanism with hardware watchdog support" (such as external/sbd in Pacemaker environments) and the time-tested "talk to a network power switch, management board etc to cut the power" methods.

Pacemaker's escalated error recovery could be likened to your friend telling you that despite his best attempts, his wound has become infected (and he can't bring himself to cut off his hand); he bravely gives away his equipment to you, kneels down, says goodbye, and you blow his brains out.

Does that drive the point home? How would you like to survive armageddon? Of course, it is always possible that you have a secret liking for becoming a zombie, and crumbling (instead of eating) all your cookies.

In this case, talk to your two friends about appropriate therapy.

29 Oct 2009 (updated 29 Oct 2009 at 11:19 UTC) »

Again a tip on how to write your OpenAIS/Pacemaker configuration in a simpler fashion; this applies to SUSE Linux Enterprise 11 High-Availability Extension too, of course.

For the full cluster functionality with OpenAIS/OCFS2/cLVM2 and an OCFS2 mount on top, you need to configure DLM, O2CB, cLVM2 clones, one to start the LVM2 volume group, and Filesystem resources to mount the file system. Add in all the dependencies needed, and you end up with a configuration pretty much like this (shown in CRM shell syntax, which is already much more concise than the raw XML):


primitive clvm ocf:lvm2:clvmd
primitive dlm ocf:pacemaker:controld
primitive o2cb ocf:ocfs2:o2cb
primitive ocfs2-2 ocf:heartbeat:Filesystem \
        params device="/dev/cluster-vg/ocfs2"
directory="/ocfs2-2" fstype="ocfs2"
primitive vg1 ocf:heartbeat:LVM \
        params volgrpname="cluster-vg"
clone c-ocfs2-2 ocfs2-2 \
        meta target-role="Started" interleave="true"
clone clvm-clone clvm \
        meta target-role="Started" interleave="true"
ordered="true"
clone dlm-clone dlm \
        meta interleave="true" ordered="true"
target-role="Stopped"
clone o2cb-clone o2cb \
        meta target-role="Started" interleave="true"
ordered="true"
clone vg1-clone vg1 \
        meta target-role="Started" interleave="true"
ordered="true"
colocation colo-clvm inf: clvm-clone dlm-clone
colocation colo-o2cb inf: o2cb-clone dlm-clone
colocation colo-ocfs2-2 inf: c-ocfs2-2 o2cb-clone
colocation colo-ocfs2-2-vg1 inf: c-ocfs2-2 vg1-clone
colocation colo-vg1 inf: vg1-clone clvm-clone
order order-clvm inf: dlm-clone clvm-clone
order order-o2cb inf: dlm-clone o2cb-clone
order order-ocfs2-2 inf: o2cb-clone c-ocfs2-2
order order-ocfs2-2-vg1 inf: vg1-clone c-ocfs2-2
order order-vg1 inf: clvm-clone vg1-clone
That's quite a bite, and becomes cumbersome for every fs you add.

However, there is a little known feature - you can actually clone a resource group:


primitive clvm ocf:lvm2:clvmd
primitive dlm ocf:pacemaker:controld
primitive o2cb ocf:ocfs2:o2cb
primitive ocfs2-2 ocf:heartbeat:Filesystem \
        params device="/dev/cluster-vg/ocfs2"
directory="/ocfs2-2" fstype="ocfs2"
primitive vg1 ocf:heartbeat:LVM \
        params volgrpname="cluster-vg"
group base-group dlm o2cb clvm vg1 ocfs2-2
clone base-clone base-group \
	meta interleave="true"

I think this speaks for itself; 20 lines of configuration reduced. You will also find that crm_mon output is much simpler and shorter, allowing you to see more of the cluster status in one go.

Today I'd like to briefly introduce a new safety feature in Pacemaker.

Many times, we have seen customers and users complain that they thought they had correctly setup their cluster, but then resources were not started elsewhere when they killed one of the nodes. With OCFS2 or clvmd, they would even see access to the filesystem on the surviving nodes blocking and processes, including kernel threads, end up in the dreaded "D" state! Surely this must be a bug in the cluster software.

Usually, it turns out that these scenarios escalated fairly quickly, because usually customers test recovery scenarios only fairly closely to before they want to deploy, or find out after they have deployed to production already. Not a good time for clear thinking.

However, most of these scenarios have a common misconfiguration: no fencing defined. Now, fencing is essential to data integrity, in particular with OCFS2, so the cluster refuses to proceed until fencing has completed; the blocking behaviour is actually correct. The system would warn about this at "ERROR" priority in several places.

Yet it became clear that something needed to be done; people do not like to read their logfiles, it seems. Inspired by a report by Jo de Baer, I thought it would be more convenient if the resources did not even start in the first place if such a gross misconfiguration was detected, and Andrew agreed.

The resulting patch is very short, but effective. Such misconfigurations now fail early, without causing the impression that the cluster might actually be working.

This does certainly not prevent all errors; it can't directly detect whether fencing is configured properly and actually works, which is too much for a poor policy engine to decide. But we can try to protect some administrators from themselves.

(As time progresses, we will perhaps add more such low hanging fruits to make the cluster "more obvious" to configure. But still, I would hope that going forward, more administrators would at least try to read and understand the logs - as you can see from the patch, the message was already very clear before, and "ERROR:" messages definitely should catch any administrators attention.)

It is with the greatest pleasure that I am able to announce that Novell has just posted the documentation for setting up OpenAIS, Pacemaker, OCFS2, cLVM2, DRBD, based on SUSE Linux Enterprise High-Availability 11 - but equally applicable to other users of this software stack.

We understand it is a work in progress, and the uptodate docbook sources will be made available under the LGPL too in the very near future in a mercurial repositoy, and we hope to turn this into a community project as well, providing the most complete documentation coverage for clustering on Linux one day!

  • So our new test cluster environment is a 16 node HP blade center, which pleases me quite a bit. The blades all have a hardware watchdog card, which of course makes perfect sense for a cluster to use.
  • However, the attempt to set the timeout to 5s was thwarted by the kernel message
    hpwdt: New value passed in is invalid: 5 seconds.
  • So in I dived into hpwdt.c, to find:
    static int hpwdt_change_timer(int new_margin)
    {
    /* Arbitrary, can't find the card's limits */
    if (new_margin < 30 || new_margin > 600) {
    printk(KERN_WARNING "hpwdt: New value passed in is invalid: %d seconds.\n", new_margin);
    return -EINVAL;
    }
  • Okay, that can happen. Sometimes driver writes have to make guesses when the vendor is not cooperative or unavailable. So who wrote the driver?
    * (c) Copyright 2007 Hewlett-Packard Development Company, L.P.
  • ...

I prefer to ignore christmas and the madness they call holidays, but would like to close the year with a series of three questions, starting today:

  1. What can Open Source (and/or Linux) contribute to making the world a better place? Think of developing nations and the real large issues, as well as the slightly smaller ones.

Please feel free to e-mail me your answers to lmb at suse dot de, but this is not required to follow this experiment.

15 Oct 2008 (updated 15 Oct 2008 at 13:28 UTC) »
  • An article by heise open covers the Linux Kongress, and also my presentation on convergence of cluster stacks, even though they represent my message it slightly more tentative than I intended it to be. But maybe I am too optimistic. For what it is worth, here is a picture of the slide where I outlined the components in the joint stack, which heise open calls a "good mix from all sources."
  • It is possibly quite important that that is my understanding of the results and goals, and even though I believe we had good buy-in in the development community, this should not be understood as a promise or commitment (or lack thereof) by Red Hat or Novell or anyone else to deliver this in the Enterprise distributions in particular, nor that there will be any loss of support for current configurations. If I could speak for both Red Hat and Novell, I would be earning a hell of a lot more money. (Some initial feedback to my blog entry here made me add this paragraph; I did discuss this in the presentation, but it is not captured on the slide shown.)
14 Oct 2008 (updated 14 Oct 2008 at 12:48 UTC) »
  • Lukas Chaplin of Linux-Lancers.com, a Linux recruiting and placement agency, has interviewed me about working from a home office. This is not yet as pervasive elsewhere as in the Open Source environment, which is really a shame.
  • Of course, before going to Lukas you should first check whether Novell & SuSE can offer you a new challenge!
  • It's been a while since I blogged, so I have two conference reports as well, starting with the Cluster Developer Summit in Prague, 2008-09-28 - 2008-10-02. (See the link for Fabio's report.)

    This Summit was organized by Fabio from Red Hat and hosted by Novell, with attendees from Oracle, Atix, NTT Japan and others, which Lon captured on this picture. It is my honest belief that within a year or two, we shall have one single cluster stack on Linux; totally awesome! Amazing how much progress one can make if one is not stuck to one's own old code, but willing to select the best-of-breed.

    I think we have come a long way in the last ten years; having explored several different paths through concurrent evolution, we are now seeing more and more convergence as there is less and less justification for the redundant effort expended. Dogs, cats, and mice eating together ... It also reinforced my opinion that small, focused developer events can be exceptionally productive.

  • At Linux Kongress 2008 in beautiful Hamburg, there were many tutorials and sessions where Pacemaker + heartbeat were used to build high-availability clusters. In my own session, I presented the last year or so of development on Pacemaker and heartbeat, and of course summarized the results from the Cluster Developer Summit.

    I also learned about a neat trick Samba's CTDB plays with TCP to make fail-over faster; of course, thanks to this being Open Source, they were able to contribute it to the community instead of reinventing their own cluster stack. (Haha, just kidding, of course they rolled their own - this is Open Source after all.) However, it should be possible to copy it and integrate it as a generic function for IP address fail-over. Cool stuff.

    I also very much enjoyed dinner with James, Jonathan, Andreas, Lars (Ellenberg), and Kay - who lives in Hamburg, but whom I only see at conferences ... Refer to the working from home offices interview!

  • Miguel: you can use getsockopt(sockfd, SOL_SOCKET, SO_PEERCRED, cred, &n) to find out the farside pid and uid from within the server.

97 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!