My colleague Tim has drawn awesome cartoons to illustrate my last cluster zombie story on why you need STONITH (node fencing). Clusters and the undead, I spot an upcoming theme for my stories ...
My colleague Tim has drawn awesome cartoons to illustrate my last cluster zombie story on why you need STONITH (node fencing). Clusters and the undead, I spot an upcoming theme for my stories ...
Why you need STONITH
A very common fallacy when setting up High-Availability clusters - be it on Pacemaker + corosync, Linux-HA, RedHat Cluster Suite, or else - is thinking that your setup, despite all the warnings in the documentation or in the logfiles, does not require node fencing.
What is node fencing?
Fencing is a mechanism by which the "surviving" nodes in the cluster make sure that the node(s) that have been evicted from the cluster are truly gone. This is also referred to as node isolation, or, in a very descriptive metaphor, STONITH ("Shoot the other node in the head"). This mechanism is not just "fire and forget", but the cluster software will wait for a positive confirmation from it before proceeding with resource recovery.
But it has already failed, otherwise it would not have been evicted, so why would this be necessary, you ask?
The key here is the distinction between appearances and reality: a complete loss of communication with a node looks to all other nodes as if the node has disappeared. Since you, like the obedient administrator that you are, have configured redundant network links, the chance for this to happen is really slim, right? But that is not the only possible cause. In fact, it might still be around, just waiting to come out of a kernel hang, or hiding behind firewall rules, to spew a bunch of corrupted data to your shared state.
In short, node fencing/isolation/STONITH ensures the integrity of your shared state by turning a mere, if justified, suspicion into confirmed reality.
(Pacemaker clusters also use this mechanism for escalated error recovery; if Pacemaker has instructed a node to release a service (by stopping it), but that operation fails, the service is essentially "stuck" on that node. The semantics of the "stop" operation mandate that it must not fail, so this indicates a more fundamental problem on that node. Hence, the default process then would be to stop all other resources on that node, move them elsewhere, and fence the node - rebooting it tends to be rather effective at stopping anything that might have been stuck. This can be disabled per-resource if you don't want some low-priority failure to shift high-priority resources around, though.)
This is all very technical. So let me tell you a story with several possible endings to illustrate.
Story time!
Once upon a time, three friends were sitting huddled around a fire, peacefully eating their cookies. It was a tough time: the world was out to get them, a zombie infection was spreading, they couldn't trust anyone outside their trusted cluster of friends. They were always watchful and paid attention to each other.
Suddenly, one of the three stops responding to the conversation they were having. How do you proceed?
So what?
In order, we have gone through the "I do not need STONITH or have disabled it", "I used the null mechanism intended only for testing", "I used an ssh-based mechanism", or the recommended "a poison-pill mechanism with hardware watchdog support" (such as external/sbd in Pacemaker environments) and the time-tested "talk to a network power switch, management board etc to cut the power" methods.
Pacemaker's escalated error recovery could be likened to your friend telling you that despite his best attempts, his wound has become infected (and he can't bring himself to cut off his hand); he bravely gives away his equipment to you, kneels down, says goodbye, and you blow his brains out.
Does that drive the point home? How would you like to survive armageddon? Of course, it is always possible that you have a secret liking for becoming a zombie, and crumbling (instead of eating) all your cookies.
In this case, talk to your two friends about appropriate therapy.
Again a tip on how to write your OpenAIS/Pacemaker configuration in a simpler fashion; this applies to SUSE Linux Enterprise 11 High-Availability Extension too, of course.
For the full cluster functionality with
OpenAIS/OCFS2/cLVM2 and an OCFS2 mount on top, you need to
configure DLM, O2CB, cLVM2 clones, one to start the LVM2
volume group, and Filesystem resources to mount the file
system. Add in all the dependencies needed, and you end up
with a configuration pretty much like this (shown in CRM
shell syntax, which is already much more concise than the
raw XML):
primitive clvm ocf:lvm2:clvmd primitive dlm ocf:pacemaker:controld primitive o2cb ocf:ocfs2:o2cb primitive ocfs2-2 ocf:heartbeat:Filesystem \ params device="/dev/cluster-vg/ocfs2" directory="/ocfs2-2" fstype="ocfs2" primitive vg1 ocf:heartbeat:LVM \ params volgrpname="cluster-vg" clone c-ocfs2-2 ocfs2-2 \ meta target-role="Started" interleave="true" clone clvm-clone clvm \ meta target-role="Started" interleave="true" ordered="true" clone dlm-clone dlm \ meta interleave="true" ordered="true" target-role="Stopped" clone o2cb-clone o2cb \ meta target-role="Started" interleave="true" ordered="true" clone vg1-clone vg1 \ meta target-role="Started" interleave="true" ordered="true" colocation colo-clvm inf: clvm-clone dlm-clone colocation colo-o2cb inf: o2cb-clone dlm-clone colocation colo-ocfs2-2 inf: c-ocfs2-2 o2cb-clone colocation colo-ocfs2-2-vg1 inf: c-ocfs2-2 vg1-clone colocation colo-vg1 inf: vg1-clone clvm-clone order order-clvm inf: dlm-clone clvm-clone order order-o2cb inf: dlm-clone o2cb-clone order order-ocfs2-2 inf: o2cb-clone c-ocfs2-2 order order-ocfs2-2-vg1 inf: vg1-clone c-ocfs2-2 order order-vg1 inf: clvm-clone vg1-cloneThat's quite a bite, and becomes cumbersome for every fs you add.
However, there is a little known feature - you can
actually clone a resource group:
primitive clvm ocf:lvm2:clvmd primitive dlm ocf:pacemaker:controld primitive o2cb ocf:ocfs2:o2cb primitive ocfs2-2 ocf:heartbeat:Filesystem \ params device="/dev/cluster-vg/ocfs2" directory="/ocfs2-2" fstype="ocfs2" primitive vg1 ocf:heartbeat:LVM \ params volgrpname="cluster-vg" group base-group dlm o2cb clvm vg1 ocfs2-2 clone base-clone base-group \ meta interleave="true"
I think this speaks for itself; 20 lines of
configuration reduced. You will also find that
crm_mon
output is much simpler and shorter,
allowing
you to
see more of the cluster status in one go.
Today I'd like to briefly introduce a new safety feature in Pacemaker.
Many times, we have seen customers and users complain that they thought they had correctly setup their cluster, but then resources were not started elsewhere when they killed one of the nodes. With OCFS2 or clvmd, they would even see access to the filesystem on the surviving nodes blocking and processes, including kernel threads, end up in the dreaded "D" state! Surely this must be a bug in the cluster software.
Usually, it turns out that these scenarios escalated fairly quickly, because usually customers test recovery scenarios only fairly closely to before they want to deploy, or find out after they have deployed to production already. Not a good time for clear thinking.
However, most of these scenarios have a common misconfiguration: no fencing defined. Now, fencing is essential to data integrity, in particular with OCFS2, so the cluster refuses to proceed until fencing has completed; the blocking behaviour is actually correct. The system would warn about this at "ERROR" priority in several places.
Yet it became clear that something needed to be done; people do not like to read their logfiles, it seems. Inspired by a report by Jo de Baer, I thought it would be more convenient if the resources did not even start in the first place if such a gross misconfiguration was detected, and Andrew agreed.
The resulting patch is very short, but effective. Such misconfigurations now fail early, without causing the impression that the cluster might actually be working.
This does certainly not prevent all errors; it can't directly detect whether fencing is configured properly and actually works, which is too much for a poor policy engine to decide. But we can try to protect some administrators from themselves.
(As time progresses, we will perhaps add more such low hanging fruits to make the cluster "more obvious" to configure. But still, I would hope that going forward, more administrators would at least try to read and understand the logs - as you can see from the patch, the message was already very clear before, and "ERROR:" messages definitely should catch any administrators attention.)
It is with the greatest pleasure that I am able to announce that Novell has just posted the documentation for setting up OpenAIS, Pacemaker, OCFS2, cLVM2, DRBD, based on SUSE Linux Enterprise High-Availability 11 - but equally applicable to other users of this software stack.
We understand it is a work in progress, and the uptodate docbook sources will be made available under the LGPL too in the very near future in a mercurial repositoy, and we hope to turn this into a community project as well, providing the most complete documentation coverage for clustering on Linux one day!
hpwdt: New value passed in is invalid: 5 seconds.
static int hpwdt_change_timer(int new_margin)
{
/* Arbitrary, can't find the card's limits */
if (new_margin < 30 || new_margin > 600) {
printk(KERN_WARNING
"hpwdt: New value passed in is
invalid: %d seconds.\n", new_margin);
return -EINVAL;
}
* (c) Copyright 2007 Hewlett-Packard Development
Company, L.P.
I prefer to ignore christmas and the madness they call holidays, but would like to close the year with a series of three questions, starting today:
Please feel free to e-mail me your answers to lmb at suse dot de, but this is not required to follow this experiment.
It's been a while since I blogged, so I have two conference reports as well, starting with the Cluster Developer Summit in Prague, 2008-09-28 - 2008-10-02. (See the link for Fabio's report.)
This Summit was organized by Fabio from Red Hat and hosted by Novell, with attendees from Oracle, Atix, NTT Japan and others, which Lon captured on this picture. It is my honest belief that within a year or two, we shall have one single cluster stack on Linux; totally awesome! Amazing how much progress one can make if one is not stuck to one's own old code, but willing to select the best-of-breed.
I think we have come a long way in the last ten years; having explored several different paths through concurrent evolution, we are now seeing more and more convergence as there is less and less justification for the redundant effort expended. Dogs, cats, and mice eating together ... It also reinforced my opinion that small, focused developer events can be exceptionally productive.
At Linux Kongress 2008 in beautiful Hamburg, there were many tutorials and sessions where Pacemaker + heartbeat were used to build high-availability clusters. In my own session, I presented the last year or so of development on Pacemaker and heartbeat, and of course summarized the results from the Cluster Developer Summit.
I also learned about a neat trick Samba's CTDB plays with TCP to make fail-over faster; of course, thanks to this being Open Source, they were able to contribute it to the community instead of reinventing their own cluster stack. (Haha, just kidding, of course they rolled their own - this is Open Source after all.) However, it should be possible to copy it and integrate it as a generic function for IP address fail-over. Cool stuff.
I also very much enjoyed dinner with James, Jonathan, Andreas, Lars (Ellenberg), and Kay - who lives in Hamburg, but whom I only see at conferences ... Refer to the working from home offices interview!
getsockopt(sockfd, SOL_SOCKET,
SO_PEERCRED, cred, &n)
to find out the farside
pid and uid from within the server.
New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!