Advogato: Blog for lmb

Today I'd like to briefly introduce a new safety feature in Pacemaker.

Many times, we have seen customers and users complain that they thought they had correctly setup their cluster, but then resources were not started elsewhere when they killed one of the nodes. With OCFS2 or clvmd, they would even see access to the filesystem on the surviving nodes blocking and processes, including kernel threads, end up in the dreaded "D" state! Surely this must be a bug in the cluster software.

Usually, it turns out that these scenarios escalated fairly quickly, because usually customers test recovery scenarios only fairly closely to before they want to deploy, or find out after they have deployed to production already. Not a good time for clear thinking.

However, most of these scenarios have a common misconfiguration: no fencing defined. Now, fencing is essential to data integrity, in particular with OCFS2, so the cluster refuses to proceed until fencing has completed; the blocking behaviour is actually correct. The system would warn about this at "ERROR" priority in several places.

Yet it became clear that something needed to be done; people do not like to read their logfiles, it seems. Inspired by a report by Jo de Baer, I thought it would be more convenient if the resources did not even start in the first place if such a gross misconfiguration was detected, and Andrew agreed.

The resulting patch is very short, but effective. Such misconfigurations now fail early, without causing the impression that the cluster might actually be working.

This does certainly not prevent all errors; it can't directly detect whether fencing is configured properly and actually works, which is too much for a poor policy engine to decide. But we can try to protect some administrators from themselves.

(As time progresses, we will perhaps add more such low hanging fruits to make the cluster "more obvious" to configure. But still, I would hope that going forward, more administrators would at least try to read and understand the logs - as you can see from the patch, the message was already very clear before, and "ERROR:" messages definitely should catch any administrators attention.)

20 Aug 2009 lmb » (Master)