Older blog entries for alanr (starting at number 3)

2001/01/27 =========================================================== 0530 Woke up. Decided that this is 0730 NYC time, so it must be time to get up ;-) Checked mail. Went back to restructuring/enhancing the the test code. Still having occasional problems with python naming and modules, but now think I have a good strategy worked out for using it. Looks like the latest iteration of restructuring is now working. Guess I'd better go off and figure out what to do next. 0800 Enough for now. I made some pretty good progress on a couple of fronts.

1055 Joined Joe Barr on his recording bridge. He had 10 questions to ask, and I answered them as best I could. He's now interested in HA things and may write an article on it. He'll likely give me another call if he decides to do so. Interview got over at about 1130. I think he got the sound bites he was looking for.

1240 Looks like the test scripts show some failures. Looking at the logs the heartbeat code is working right, but the test code doesn't think it is. The case is the "restart" test when only one machine is up looks for the pattern for "remote machine has joined" when it should be looking for the pattern "local machine has joined" instead. Don't know why yet. 1315 Gotta go to Castle Rock and then going-away party for my cousin :-( Bye.

2001/01/26 ================================================================= 0640 Really tired this morning. Guess I'm getting too old to get less than 6 hours sleep too often. Laura is feeling a little better this morning, so she went to school today.

Surprisingly, no reply from Lars on the CTS code. Ahhh... It just came in, in 2 parts. I responded to one of them. Decided to clean up my Trash folder, as it has over 10K unread messages in it. I'll get rid of all Trash from last year. Good to take out the trash once a year whether it needs it or not ;-)

0800 Need to get dressed, etc. 0825 Back to work. I need to look over the current copy of "Enterprise Linux" it has a pretty cool cover article about the Weather Channel. Didn't actually read the article yet. Responded to Lars' comments. 0940 Finished responding to Lars' comments about CTS. Started implementing some of them. Splitting into multiple files, separating out the audit class. 1030 I'm exhausted and have a headache. Time to take a decongestant, a break and maybe a nap. Maybe I need to eat something? I see that we've done 602 iterations of the heartbeat code without any errors. This time I'm including the Stonith test in my set of tests to run (it slows things down a lot).

1055 Back to the salt mines :-) I feel a bit better. I see we're up to 655 iterations. I wrote the Audit class. I guess I'll stop the ongoing tests on 'servidor' and actually try the restructured code and see if it works, as opposed to "just compiles"

1145 Headache is back. Time for something stronger... Time for lunch... Went to lunch. Received a few boxes full of hardware for installing the network. Spent about 45 minutes checking the stuff out, making sure it was all there etc. Laura came home sick and exhausted, took her to Lunch, since she hadn't eaten. Still don't feel right. Took a half-hour nap. Spent a half-hour or so helping Amy get xawtv working on her PC, without much success. Having trouble importing some Python classes. Learning curve, I guess... (could it be a Python bug?)

Got email from Paddy about possible FailSafe meeting times. Replied, told him to avoid the CLIQ, 'cause I'm running a BOF (and representing SuSE?) there. Got email from Mia with corrected arrival date for hotel.

1740 Time to call it quits for a while and get Laura (and me) dinner. 2000 Called Joe Barr, and set up the appointment for the interview tomorrow at 11 AM. He seems like a really nice guy. I'm now writing the code for the CtsLab class. 2120 Tired. Going to bed. But, I feel better than I did earlier today.

2001/01/25 ================================================================= 0525 Started work. This will be an odd day. Thursdays always are. Today a little more so than normal.

I see the overnight run I made crapped out after about 15 minutes because I had too many open files. Hmmm... Never saw that before... Not surprisingly, it was in the new AuditResources code... It was doing a popen for determining if the other node is up. I'm not waiting for the child process to finish before going on. I'll see if waiting for it to finish helps... I see it's gone 130 iterations this time. Before it only went 60. That's a good sign. Looks like that fixed it. It's been > 300 iterations.

I got email from lars about the CTS. I've been responding to it. He has some good comments.

0615 Need to get dressed to take Kathy to school so I can have a car today. My wife is sick, my mother-in-law has an infection from her surgery and my father-in-law and I both have doctors' appointments today...

0700 Back to work...

I'm continuing to respond to Lars' email. He made a couple of good points, and some I don't care about. Completing my reply took exactly an hour. We're now up to 580 successful test iterations.

3-4 people subscribed to the linux-ha-dev list today. Replying to them took until 0920 or so. More email, more travel planning...

1025 Time to go to Doctor's appt.

Went to Doctor's, did about 15 mins of coding, went to lunch with a good friend who needed some time to talk. Got done about 1400. Picked up Kathy from school at about 1440

1500 Started back to work. Lots of email arrived while I was gone. They changed my hotel reservation, so I have to print off new stuff to carry with me and tell Wombat new hotel name.

Included in the email was a VIRUS ALERT, TELL ALL YOUR FRIENDS! ;-)

Apparently disconnecting my laptop stopped the tests running. I had about 1100 iterations at that point.

Got an subscription email from a commercial HA firm. I sent them the same "welcome to the list, what brings you here?" note I send everyone. It'll be interesting to hear what they say.

1645 Need to go make preparations for dinner, etc. Laura stayed in bed all day. No word from my in-laws yet on how they did.

2330 Decided to check mail and read about the worm. Took about an hour. I see the tests I had started finished just fine. G'night.

2001/01/24 =================================================================

I started work this morning about 7:15.

I spent the first two hours this morning dealing with email and talking to Lars on IRC. He now knows my situation and a bit more about the priorities in SuSE, Inc. I agreed to write up a few paragraphs on the Cluster Test System (CTS) for him.

I made a doctor's appointment for tomorrow morning so I can get some prescriptions refilled before taking off to NYC. (about 15 mins)

I spent about an hour or so writing up the CTS for Lars.

I spent about 15 minutes explaining to MilesTek about the troubles I had ordering equipment from their web site. I scanned in some pages and emailed them out. Sigh...

Responded to some email from MC Linux about Stonith. They're considering adopting it, and had a few questions about the expect() function in it. My reply seemed to satisfy them. Guess that's good.

Took off for lunch at 12:17 PM, returned as 14:10. Had to make a trip by the house and pick up Laura from work.

Set up an appointment with horms for Wednesday dinner.

I wrote the code to tell if some, all or none of the resources in a group are held by the current node. Probably even works ;-)

Doing conference paperwork: Scheduling things, getting the current schedule for the conference room, etc. This will probably take me an hour to do. Meeting with Horms (VA Linux), Ben Rafanello (IBM), Wombat (Peter Badovinatz @ IBM), Thomas Schaffner (Enterprise Linux), Mike McQuaid (Winchester Systems). I also talked to Jon Doyle for a half-hour or so somewhere in here.

Sent some email about the heartbeat API to Ericcson in Montreal. Took about 15 minutes to write.

1645 quitting work for a while (Dogs are going nuts, and wife is sick). 2112 back for a bit. Gonna work on the resource stability thing... Finally backed up the laptop ;-) 2200 Going to bed. Got the new cts.py code working including polling for resources to become acquired.

2001/01/23 =================================================================

I started work this morning about 7:30. I took about a half-hour off for lunch. I stopped around 4:45 or so and put in a half hour or so later in the evening to catch up on email, etc.

More updates to the test suite. Basic Resource Auditing works! It's now in CVS too.

Need to get the CTS harness to not audit resources too soon. It looks like the IP addresses aren't getting set up as fast as the auditing is taking place.

Further examination seems to bear this out, but the heartbeat code doesn't give any particular message when the transition takeover scripts have completed. I put in a little code to loop for a while re-auditing things until they get better. They always seem to get better at least ;-)

There are at least four possible cases: A machine went down: It held resources - we will take them over It didn't hold resources - we won't take them over A machine came up It will request resources (only machine, not nicefailback) It won't request resources: it has none, or nicefailback

Or maybe it's simpler than that? A machine came up - resource acquisition prints completion msg in all cases A machine went down - takeover code prints msg when done in all cases

What this really is is looking for the completion of a transition. Right now the code doesn't really know when the resources have been fully acquired locally. This is not a good thing.

I suppose what I need is a message whenever it completes acquisition of a set of resources, or when it decides it's not going to.

I put in some new messages that indicate when acquisition of resources completes when done by heartbeat, but not for system failover takeovers. Those will have to go in the mach_down script or something like that. I'll try and get that later tonight. My goal for tonight is to fix this resource auditing problem.

It appears that this will require a new script which synchronously waits for resources to become served. It would be called by mach_down. Or, I suppose that mach_down could just do this itself, but this all sounds really hard, because of the messaging model used by the scripts. Maybe I could use a directory in "/var/lib/heartbeat" to keep track of what resources have been acquired. Or, I suppose I could poll to wait for them to be taken over... Yuck... Could be worse, I guess... Either way I think I get to poll...

I guess I'll just change mach_down to poll for the resources that we are still waiting to acquire rather than add new scripts. This is best done by enhancing ResourceManager to have a groupstat command or something like that, then mach_down can use that without duplicating a lot of code.

This item (the test harness) took by far the majority of my time. I suppose about 60-70%

Dropped Lars an email telling him about the updates to the test suite. Emailed some guy in France about publishing a Stonith paper for an IEEE journal.

Updated the HA web site with several minor things including stuff for Kimberlite, and the Open Cluster group (OSCAR).

Talked for a half-hour or so to Winchester Systems about getting an eval unit of their multi-interface RAID box. Made an appointment to talk at NYC.

Minor updates to the HA thoughts doc about various concerns.

I'm worried about Samba failover, and I'm worried about NFS failover. Jeremy Allison thinks Samba failover is hard, but it may be mainly an app thing. MC Linux has done the NFS failover and thinks it's hard. This may be partly smoke screen. Maybe we can get by without lock failover?

Started this Journal.

Emailed Ibrahim the suggested new paragraph for the Linux Journal.

2001/01/22 =================================================================

Spent several hours struggling with fetchmail problems. Finally got it working again with help from Chris Mahmood. Oakland had changed a bunch of things and they didn't take effect until a reboot happened over the weekend. Wrote a bunch of code associated with resource auditing for the test suite. This includes the modification of the ClusterManager class and the creation of the new Resource Class. Committed the changes to CVS. Wrote the "HA Thoughts" document for where we're going with HA in SuSE. Spent a bunch of time trying to figure out what the Baytech is doing. It seems to pause for a second every 3-4 seconds, but respond OK otherwise, But more ominously it seems to give connection refused for a second or two every so often at seemingly random times.

Over lunch tried to call WebGear. They seem to be out of business!

Updated the HA thoughts doc.

I've decided to try keeping my journal up-to-date as a way of tracking (and hopefully improving) my personal productivity. Unfortunately I had forgotten my Advogato password. But now I have it, so away I go...

This entry is really for yesterday (2000/01/16).

I spent a while trading in a plane ticket for COMDEX (which I didn't go to), for a ticket to LWCE at the end of the month.

I continued a dialogue with lmb about changing the Stonith API. We both agree it needs to change, and I think we're converging on how to change it.

I integrated multicast support into heartbeat CVS.

I integrated APC UPS support code into the Stonith subsystem, and put it under CVS.

Since other folks that I (mostly) don't know wrote these pieces of code, the only conclusion that I can draw is that this open source stuff must be working ;-).

I wrote up some release procedures for heartbeat and posted them on the web.

I got the CVS version to build correctly again after all these changes and put it on my test machines.

(I'm trying to follow my own release procedures ;-))

Things I didn't expect to do was deal with a failure of the black printhead on my HP 2000C printer (it failed about 30% into its expected life).

I dealt with some folks from Avaya, and bought someone lunch who took me to CompUSA to get the print head. I fixed the stupid printer, and helped the fellow who took me to get the print head a little as he repaired our vacuum cleaner.

Somehow my ssh setup for my labs was broken, so I needed to repair that for my test tools.

All in all, a reasonably productive day.

I've been working on the heartbeat API. It actually works pretty well. Marcelo found a couple of bugs in it, and suggested restructuring a small piece of it. So, I fixed one of the bugs, will fix another, and let him do the restructuring (he wanted to). Marcelo's a good guy. I work closely with their people, and really anyone who's interested. I've issued specific invitations to everyone who's active in this area, including those guys in NC. Right now, we have folks from many companies using and contributing to heartbeat. It's a blast!

Linux Fail Safe is nearing it's open source release. We're getting pretty excited about it. It's by far the most powerful of the High-Availability products, open or closed source.

I hope to position heartbeat to be able to do membership and low-level communication for lots of different projects. We'll write a new simple cluster manager, and use the heartbeat API. There is a place for an HA batch queueing system. Of course, it could use heartbeat ;-)

I hope to change FailSafe to use it. Perhaps even the folks at Mission-Critical Linux could use it. SGI is eyeing it for things I don't think I'm free to talk about.

It's basic, but it works pretty darn well, and gets better all the time ;-)

I got some nice feedback from Eric Ayers about my talk at the ALE (Atlanta Linux Enthusiasts) meeting last month. If you want me to speak to your LUG or conference about Linux-HA, let me know. I like giving talks.

I guess I ought to write at least one journal entry.

Lately, I've been spending most of my time doing at least six different things:

Promoting Linux-HA. A week ago last Thursday (whenever that was), I spoke to the Atlanta Linux Enthusiasts. Going to Atlanta in July wasn't my idea of good timing (it's hot and humid then), but they audience was very interested, and quite well-informed. The talk was very well received, and I even got an idea for a useful feature in heartbeat, which I implemented a few days later.

Working on reset code for LinuxFailSafe. It uses the STONITH API below.

Designing, writing, implementing and changing a STONITH API. STONITH == Shoot The Other Node In The Head. Also called STOMITH, substituting Machine for Node. I like STONITH, because of the similarity to Stoning a person representing the ultimate rejection from the community. In any case, I've been designing the abstract API, and writing code to implement it for the BayTech RPC-5.

Designing and implementing an API for heartbeat. Heartbeat is pretty nice in several ways, but it is limited in what it can do. It does heartbeats better than any other open source product I know of, but doesn't integrate with other applications to speak of. The API will allow it to be easily used with lots of other applications, whether with FailSafe, or Piranha, or CXFS, or Kimberlite, or with Stephen's new cluster manager, or some newly designed cluster manager or whatever. It is nearly complete, but needs some minor redesign to eliminate certain security issues from it before people start using it. You can get the code for this and the Stonith API from the Linux-HA CVS repository.

Generally working on heartbeat. Fixing it up, etc.

Strategizing on how SuSE should promote and package Linux-HA. Generally worrying about what should be done, and puzzling over how to get it done. This activity overlaps with lmb.

General Notes
I just got a new user for heartbeat that I am absolutely sure will need some tech support this winter. Heartbeat is now running in Tahiti :-)

I just found out that a talk I gave back in April won an award for the best talk of the day at the Lucent Technologies Software Symposium. That was certainly nice.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!