30 Mar 2010 lmb   » (Master)

Why you need STONITH

A very common fallacy when setting up High-Availability clusters - be it on Pacemaker + corosync, Linux-HA, RedHat Cluster Suite, or else - is thinking that your setup, despite all the warnings in the documentation or in the logfiles, does not require node fencing.

What is node fencing?

Fencing is a mechanism by which the "surviving" nodes in the cluster make sure that the node(s) that have been evicted from the cluster are truly gone. This is also referred to as node isolation, or, in a very descriptive metaphor, STONITH ("Shoot the other node in the head"). This mechanism is not just "fire and forget", but the cluster software will wait for a positive confirmation from it before proceeding with resource recovery.

But it has already failed, otherwise it would not have been evicted, so why would this be necessary, you ask?

The key here is the distinction between appearances and reality: a complete loss of communication with a node looks to all other nodes as if the node has disappeared. Since you, like the obedient administrator that you are, have configured redundant network links, the chance for this to happen is really slim, right? But that is not the only possible cause. In fact, it might still be around, just waiting to come out of a kernel hang, or hiding behind firewall rules, to spew a bunch of corrupted data to your shared state.

In short, node fencing/isolation/STONITH ensures the integrity of your shared state by turning a mere, if justified, suspicion into confirmed reality.

(Pacemaker clusters also use this mechanism for escalated error recovery; if Pacemaker has instructed a node to release a service (by stopping it), but that operation fails, the service is essentially "stuck" on that node. The semantics of the "stop" operation mandate that it must not fail, so this indicates a more fundamental problem on that node. Hence, the default process then would be to stop all other resources on that node, move them elsewhere, and fence the node - rebooting it tends to be rather effective at stopping anything that might have been stuck. This can be disabled per-resource if you don't want some low-priority failure to shift high-priority resources around, though.)

This is all very technical. So let me tell you a story with several possible endings to illustrate.

Story time!

Once upon a time, three friends were sitting huddled around a fire, peacefully eating their cookies. It was a tough time: the world was out to get them, a zombie infection was spreading, they couldn't trust anyone outside their trusted cluster of friends. They were always watchful and paid attention to each other.

Suddenly, one of the three stops responding to the conversation they were having. How do you proceed?

  1. My cluster of friends does not require such a crude mechanism! He'll be careful not to have been infected! If he stops responding, he will simply be dead! You ignore the problem, but then your former friend revives, spreads his infection to your cookie stack, starts clobbering you with a club to eat your brains, and his howl gives away your location to all his new friends, who come down on you with the intent of eating your brains.
  2. You use an unloaded gun to shoot your friend - the trigger responds reassuringly. Your former friends revives, and it is all about eating your brains again.
  3. You kindly tap your friend on the shoulder, and suggest that he please commit suicide. Your former friend revives, snaps at your tapping hand, and starts eating your brains.

  4. You speak a pre-agreed upon code word, a tiny bomb goes off in the head of your friend, blows his brains out, and he drops on the spot. The grue does not eat you. (In fact, the mechanism monitoring his brain probably has already blown him up, but you speak the code word anyway to make sure.)

  5. You take that crude, trusty shotgun and blow his brains out, aiming away from the stack of cookies. The grue does not eat you.

So what?

In order, we have gone through the "I do not need STONITH or have disabled it", "I used the null mechanism intended only for testing", "I used an ssh-based mechanism", or the recommended "a poison-pill mechanism with hardware watchdog support" (such as external/sbd in Pacemaker environments) and the time-tested "talk to a network power switch, management board etc to cut the power" methods.

Pacemaker's escalated error recovery could be likened to your friend telling you that despite his best attempts, his wound has become infected (and he can't bring himself to cut off his hand); he bravely gives away his equipment to you, kneels down, says goodbye, and you blow his brains out.

Does that drive the point home? How would you like to survive armageddon? Of course, it is always possible that you have a secret liking for becoming a zombie, and crumbling (instead of eating) all your cookies.

In this case, talk to your two friends about appropriate therapy.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!