Older blog entries for lmb (starting at number 102)

It is with the greatest pleasure that I am able to announce that Novell has just posted the documentation for setting up OpenAIS, Pacemaker, OCFS2, cLVM2, DRBD, based on SUSE Linux Enterprise High-Availability 11 - but equally applicable to other users of this software stack.

We understand it is a work in progress, and the uptodate docbook sources will be made available under the LGPL too in the very near future in a mercurial repositoy, and we hope to turn this into a community project as well, providing the most complete documentation coverage for clustering on Linux one day!

  • So our new test cluster environment is a 16 node HP blade center, which pleases me quite a bit. The blades all have a hardware watchdog card, which of course makes perfect sense for a cluster to use.
  • However, the attempt to set the timeout to 5s was thwarted by the kernel message
    hpwdt: New value passed in is invalid: 5 seconds.
  • So in I dived into hpwdt.c, to find:
    static int hpwdt_change_timer(int new_margin)
    {
    /* Arbitrary, can't find the card's limits */
    if (new_margin < 30 || new_margin > 600) {
    printk(KERN_WARNING "hpwdt: New value passed in is invalid: %d seconds.\n", new_margin);
    return -EINVAL;
    }
  • Okay, that can happen. Sometimes driver writes have to make guesses when the vendor is not cooperative or unavailable. So who wrote the driver?
    * (c) Copyright 2007 Hewlett-Packard Development Company, L.P.
  • ...

I prefer to ignore christmas and the madness they call holidays, but would like to close the year with a series of three questions, starting today:

  1. What can Open Source (and/or Linux) contribute to making the world a better place? Think of developing nations and the real large issues, as well as the slightly smaller ones.

Please feel free to e-mail me your answers to lmb at suse dot de, but this is not required to follow this experiment.

15 Oct 2008 (updated 15 Oct 2008 at 13:28 UTC) »
  • An article by heise open covers the Linux Kongress, and also my presentation on convergence of cluster stacks, even though they represent my message it slightly more tentative than I intended it to be. But maybe I am too optimistic. For what it is worth, here is a picture of the slide where I outlined the components in the joint stack, which heise open calls a "good mix from all sources."
  • It is possibly quite important that that is my understanding of the results and goals, and even though I believe we had good buy-in in the development community, this should not be understood as a promise or commitment (or lack thereof) by Red Hat or Novell or anyone else to deliver this in the Enterprise distributions in particular, nor that there will be any loss of support for current configurations. If I could speak for both Red Hat and Novell, I would be earning a hell of a lot more money. (Some initial feedback to my blog entry here made me add this paragraph; I did discuss this in the presentation, but it is not captured on the slide shown.)
14 Oct 2008 (updated 14 Oct 2008 at 12:48 UTC) »
  • Lukas Chaplin of Linux-Lancers.com, a Linux recruiting and placement agency, has interviewed me about working from a home office. This is not yet as pervasive elsewhere as in the Open Source environment, which is really a shame.
  • Of course, before going to Lukas you should first check whether Novell & SuSE can offer you a new challenge!
  • It's been a while since I blogged, so I have two conference reports as well, starting with the Cluster Developer Summit in Prague, 2008-09-28 - 2008-10-02. (See the link for Fabio's report.)

    This Summit was organized by Fabio from Red Hat and hosted by Novell, with attendees from Oracle, Atix, NTT Japan and others, which Lon captured on this picture. It is my honest belief that within a year or two, we shall have one single cluster stack on Linux; totally awesome! Amazing how much progress one can make if one is not stuck to one's own old code, but willing to select the best-of-breed.

    I think we have come a long way in the last ten years; having explored several different paths through concurrent evolution, we are now seeing more and more convergence as there is less and less justification for the redundant effort expended. Dogs, cats, and mice eating together ... It also reinforced my opinion that small, focused developer events can be exceptionally productive.

  • At Linux Kongress 2008 in beautiful Hamburg, there were many tutorials and sessions where Pacemaker + heartbeat were used to build high-availability clusters. In my own session, I presented the last year or so of development on Pacemaker and heartbeat, and of course summarized the results from the Cluster Developer Summit.

    I also learned about a neat trick Samba's CTDB plays with TCP to make fail-over faster; of course, thanks to this being Open Source, they were able to contribute it to the community instead of reinventing their own cluster stack. (Haha, just kidding, of course they rolled their own - this is Open Source after all.) However, it should be possible to copy it and integrate it as a generic function for IP address fail-over. Cool stuff.

    I also very much enjoyed dinner with James, Jonathan, Andreas, Lars (Ellenberg), and Kay - who lives in Hamburg, but whom I only see at conferences ... Refer to the working from home offices interview!

  • Miguel: you can use getsockopt(sockfd, SOL_SOCKET, SO_PEERCRED, cred, &n) to find out the farside pid and uid from within the server.
23 Aug 2008 (updated 23 Aug 2008 at 22:20 UTC) »
  • Hi all, long time no blog. But with the recent announcement of the Linux Kongress 2008 program, which will happen in my chosen home city Hamburg from 7th to 10th October, I have to share the joy:

    Not one, but three tutorials - both in English and German - explaining how to use Linux-HA with the CRM/Pacemaker as an high-availability cluster environment.

    Congratulations and thanks to Ralph Dehner, Lars Ellenberg, Joerg Jungermann, Maximilian Wilhelm!

    Also, a brief talk by myself on the future of HA on Linux, fresh from the Cluster Developer Summit in Prag.

    All in all, Linux Kongress has a very, very strong program this year, and I look forward to meeting you all in Hamburg - bring your umbrella!

  • On Monday, Hack Week 2008 begins. I will be working on shared storage-based fencing for heartbeat, and possibly some others projects relating to clustering.

    I also look forward in particular to the First Penguin Award candidates: the price for the most daring failure. Failure is crucial to success; learning where the boundaries of our models and theories are is the foundation of science, and successful design. Only by anticipating and overcoming failure is success possible. If you doubt this for a single moment, read Petroski: Success through failure.

    As a member of the panel and obsessed with things going wrong, I hope your project contributes to our knowledge; and the most valuable lesson of the whole week just could be learned from showing what does not work. And, there will be a price too! How good is that?

Jozef has posted a very cool solutions article describing how to build a highly-available load-balancing solution for any TCP-based network service (including mail, web, ftp, etcetera) using entirely Open Source components and of course all included with SUSE Linux Enterprise Server 10 SP2 - Linux-HA, Linux Virtual Server, and ldirectord. Rock on!

Of course, you could buy an expensive appliance instead ...

Bad Syabas. They manufacture the Popcorn Network Media Tank, and despite clearly running a Linux variant, no source code nor written offer to supply it. Kindly e-mailed their support to rectify the situation ASAP.

In this post, Alan Robertson discusses cluster stacks. This is interesting, but has some misleading points:

  • Linux-HA (with or without OpenAIS) supports the AIS membership APIs. This is not quite correct, in as far as the support of the APIs provided is close to ancient, and - worse - that membership by itself is rather pointless; Linux-HA as-is does not provide the messaging or any other of the APIs for AIS, so the membership itself does not mean that any AIS application could run.

  • Nevertheless, in an ideal world, all cluster components and cluster-aware applications would sit on top of the same set of communications protocols. Let's just keep this one in mind, we're going to need it below!

  • The Linux-HA CRM function is largely divided between the PE and TE – which are described below. The CRM has been split out from the Linux-HA heartbeat project by its developers; I'm not sure how Alan failed to mention this, as he has been objecting to it for the last few weeks ;-)

    Technically, the description is not quite right either. The CRM itself is a fairly important component, electing the transition coordinator, dealing with failed nodes and implementing the state transitions at the cluster level. Its components not only include the Policy Engine or the Transitioner, but the CIB itself also is part of the CRM modules.

  • It's interesting how the PE receives the largest share of criticism, while no comments are made about the scalability and performance of the messaging layer itself. Oh well. The PE actually is modularized and completes its task in several stages - the original design called for placement first and ordering later, as distinct steps -, but the modules have a high inter-dependency, and in practice, it turned out not to be so easy; clear and robust interfaces are very hard to define. For a similar problem, look at how gcc "modularizes" its optimization steps.

    While the PE does perform round-robin load-balancing, full resource cost and load balancing attempts turn the problem into an exceptionally hard one; we considered this, and then postponed it until later. For now, our main goal is to keep services alive, and leave the load balancing to some external component which modifies our node weights; seems fairly modular to me, in fact.

    It's true that we might step towards modularization (again!) as we understand the problem more and more, but I object to the underlaying assumption that we hadn't thought of all that before.

  • The LRM proxy communicates between the CRM and the LRMs on all the various machines. This function is currently built into the CRM. This architectural decision was based on expedience more than anything else. I wonder how else the CRM's TE is supposed to communicate with the LRMs, as needed to carry out the commands and retrieve status, if not by having some form of proxy/interface to them?

  • To support larger clusters this needs to be separated out, made more scalable, and more flexible. This would allow a large number of LRMs to be supported by a small number of LRM proxies. The CRM and its components (TE, CIB) clearly requires an interface to the LRMs, so I'm not quite sure how this could be separated out.

    My guess would be that he is refering to the idea of having the CRM manage nodes (virtual or physical) which are not full cluster members as containers for resources. And, supposedly, not suggesting to treat them as virtual cluster members at the membership level any longer! Nice to see he's dropped that idea. Yet, as Alan likes to give credit when he came up with something, maybe he should give credit for this as well ...? Just thinking.

  • In large systems, this would probably use the ClusterIP capability to provide load distribution (leveling) across multiple LRM proxies. I have absolutely no idea what this is supposed to suggest.

  • The description of the quorum daemon might imply the suggestion that Linux-HA supported general split-site clusters right now. As much as I wish it did, this is not true.

    And while quorum in two-node clusters is indeed problematic (because they always have a tie on one node down), the quorum server most certainly is not needed for two node clusters, as fencing resolves this problem nicely, and has done so for years.

  • For a variety of reasons, kernel space doesn't have access to user-space cluster communications or membership.
    As a result, both the DLM and most cluster filesytems implements their own membership and communications.
    This is technically incorrect; OCFS2 has been instrumented to inherit the membership from user-space, as has GFS2. (Or, in fact, their DLMs inherit this.)

    The discussion of case 1 neglects the detail that the "other" membership also must be told to not talk to the other node, same as case 2; in fact, each membership must be reduced to the common subset. The method described for case 2 indeed is not pretty, and would not work right now (as the mechanisms do not exist), as claimed:

  • Although Case 2 isn't pretty, it works, and no amount of wishing and hoping is likely to ever make this kind of problem go away in the general case.

    This is quite certainly the most confusing message in this lecture. First, it is wrong today, even for Linux-HA: OCFS2 avoids this by inherting the Linux-HA membership through the Filesystem resource agent.

    Second, by porting the CRM modules - now called PaceMaker - to run natively on top of openAIS, just as C-LVM2, GFS2, and OCFS2 will, we are finally on the track to solve this perfectly and having everyone use the same membership.

    However, it should be noted that there has been exactly one person unhappy about this, who is now trying to sell it as if it was his idea, and not that he opposes it still - I wonder, who might that be?

I will further admit that it irks and offends me that Alan talks of the CRM as our work (as if he had been involved much in it), and explicitly mentions how he started the OCF in 2001, mentions IBM and Red Hat, yet completely fails to mention the contributions made by many Novell and SUSE engineers, most notably by Andrew Beekhof. Oh well.

93 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!