<?xml version="1.0"?>
<rss version="2.0">
  <channel>
    <title>Advogato blog for alanr</title>
    <link>http://www.advogato.org/person/alanr/</link>
    <description>Advogato blog for alanr</description>
    <language>en-us</language>
    <generator>mod_virgule</generator>
    <pubDate>Wed, 19 Jun 2013 19:56:09 GMT</pubDate>
    <item>
      <pubDate>Sun, 11 Mar 2001 21:45:42 GMT</pubDate>
      <title>11 Mar 2001</title>
      <link>http://www.advogato.org/person/alanr/diary.html?start=6</link>
      <guid>http://www.advogato.org/person/alanr/diary.html?start=6</guid>
      <description>I've been testing some Weird stuff.  Got a bug report from a
guy from Dell that it dies after a day - right about the
time the code decides to print all the processes' memory
stats.  He says it happens every time.

&lt;p&gt; He's running with stonith enabled (which I hardly ever do,
it's hard on the hardware).  I can reproduce his problem
quite readily - I sent the mem stat interval down to 5
minutes, and boom every 5 minutes it dies.

&lt;p&gt; It needs stonith for it to fail.  I removed the stonith
option and it didn't fail.

&lt;p&gt; I can also make it happen when I send it a SIGUSR2.

&lt;p&gt; It looks like if I send a SIGUSR2 to the highest process id,
then it prints out the memory stats and dies -- or just
dies...

&lt;p&gt; I need to look at the config file when it comes back up...
I had 8 processes, one of them a zombie on sgi2.  I'm not
sure how many the config wanted...

&lt;p&gt; The config had two links: one serial and one udp.
The processes we create are:
&lt;pre&gt;
	control process (parent process: runs last)
	write process
	read process
	write process
	read process
	master status process
&lt;/pre&gt;

&lt;p&gt; But, I had 8 processes at that time...

&lt;p&gt; There seems to be something wrong here ;-)

&lt;p&gt; Here's the pids from the logs:
&lt;pre&gt;
310	Prints "configuration validated"
312	prints "udp heartbeat started on..."
320	Still running... Locked...
321	Still running...
322	Still running... Locked...
322	Still running... Locked...
323	Still running... NOT Locked...
324	prints local status now set to... and Heartbeat restart
on...
	prints "link sgi2:eth0 up"  Still running LOCKED
	prints "resource acquisition completed (none)"
	prints "Link sgi1:eth0 dead"
	prints "mach_down takeover complete"

&lt;p&gt; 644	defunct... prints "resource acquisition completed"

&lt;p&gt; 697	Control process...
697	Also prints heartbeat restart on... (?)
697	Also prints "link sgi2:eth0 up"
	
684	Control process
	prints "starting serial heartbeat on ..."
693	HBWRITE
694	HBREAD
696	HBWRITE
696	HBREAD
697:	writes all the messages ;-)
	Master status process.
	prints and Heartbeat restart on... but only for local node
&lt;/pre&gt;


&lt;p&gt; Found a fork in initiate_reset() with no exit at the end...
and in an error leg in req_our_resources(), and
giveup_resources()

&lt;p&gt; This appears to have been the bug.  After replacing the
implicit return with an explicit exit, and doing so in a
couple of other places in some funky error legs, I can't
reproduce the problem any more.

&lt;p&gt; I had also gotten a bug report that the multicast
option-parsing code didn't work.  I had "broken" it by
fixing the ppp-udp code.  However, my change was correct,
and the multicast parsing code was incorrect.
So, I fixed the multicast parsing code.  I discovered in the
process that even with the bug fix in, that it didn't work
because the install process didn't install the mcast code. 
So, now I have both the mcast code working and this bizarre
Stonith bug fixed.
I've been running this test configuration with multicast
(which I'd never tested before), and the stonith fix (but
stonith turned off, because I suspect that the test code
won't deal well with the machines getting rebooted each time
they leave the cluster).  Guess I ought to run a hundred
iterations or so of that. (and fix the test code if it's
broken).
 
Robert_Macaulay@Dell.com (the original bug reporter) is
currently setting it up for testing on his machines.  It
seems pretty likely that it'll work just fine for him.

&lt;p&gt; I ran 1000 iterations of the test code.  The final results
are:
&lt;tt&gt;
2001/03/11_14:23:38     Running test Restart [1000]&lt;br&gt;
2001/03/11_14:24:26     Stopping Cluster Manager on all
nodes&lt;br&gt;
2001/03/11_14:24:31     ****************&lt;br&gt;
2001/03/11_14:24:31     Overall Results:{'BadNews': 0,
'success': &lt;b&gt;1000&lt;/b&gt;, 'failure': &lt;b&gt;0&lt;/b&gt;}&lt;br&gt;
2001/03/11_14:24:31     ****************&lt;br&gt;
2001/03/11_14:24:31     Detailed Results&lt;br&gt;
2001/03/11_14:24:31     Test Restart:{'success': &lt;b&gt;524&lt;/b&gt;,
'WasStopped': 156, 'node:sgi1': 253, 'calls': 524,
'node:sgi2': 271, 'skipped': 0, 'failure': &lt;b&gt;0&lt;/b&gt;,
'auditfail':&lt;b&gt;0&lt;/b&gt;}&lt;br&gt;
2001/03/11_14:24:31     Test flip:{'down-&amp;gt;up': 160,
'up-&amp;gt;down': 316, 'success': &lt;b&gt;476&lt;/b&gt;, 'started': 160,
'calls': 476, 'stopped': 316, 'skipped': 0, 'failure':
&lt;b&gt;0&lt;/b&gt;,
'auditfail': &lt;b&gt;0&lt;/b&gt;}&lt;br&gt;
2001/03/11_14:24:31     &amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt; TESTS COMPLETED&lt;br&gt;

&lt;p&gt; 435.94user 75.19system 13:28:48elapsed 1%CPU
(0avgtext+0avgdata 0maxresident)k&lt;br&gt;
0inputs+0outputs (6605568major+2597838minor)pagefaults
0swaps
&lt;/tt&gt;
&lt;p&gt;                                     
This is great news.  I need to run another set of 1000, and
then some other tests (probably involving the stonith_host
option), and then we'll declare it stable I
think.  Many thanks to  Aaron Nienhuis and Robert Macaulay
for finding these bugs and saving our users from finding
them in a "stable" release.</description>
    </item>
    <item>
      <pubDate>Thu, 15 Feb 2001 16:12:18 GMT</pubDate>
      <title>15 Feb 2001</title>
      <link>http://www.advogato.org/person/alanr/diary.html?start=5</link>
      <guid>http://www.advogato.org/person/alanr/diary.html?start=5</guid>
      <description>2001/02/04 (Sunday)
==========================================================
	Spent an hour or more unpacking from the trip spread across
various
	times of the day.  This isn't quite so timeconsuming as
packing,
	fortunately ;-)

&lt;p&gt; 1130	Am reconfiguring various computers in the network
today.
	Need to move the KVM switch down to the basement and put it
on
	the ones down there.  Hooked up one of the machines.  Need
2
	more cables to do them all.  Better go get them, I guess
;-)
	I converted my talk to HTML and pushed it and the
StarOffice source
	both out to the linux-HA site.  Did a little more general
updating
	of the site.  Much more still needs to be done,
unfortunately.
1230	This stuff took about an hour or so.

&lt;p&gt; 2100	Writing a script to update the lab machines with RPMS
automagically.
2145	Done.  
	Gonna try and get the lab machines updated and a set of
tests started
	to see if the fix I made in NYC works.
2200	Tests started successfully.  Bye for now.

&lt;p&gt; 2001/02/05
=================================================================
0645	Not much mail came in last night.  Better check my
fetchmail.
	Last night's test run of 500 iterations completed
successfully at 0630.
	So, the fix I made on the plane/hotel room during the LWCE
works.
	Good timing ;-)  I got some more mail.  Fetchmail must be
working OK.
	All 500 iterations succeeded. I also cut down the standard
failover
	time to 3 seconds.  Need up update the changelog.  The
viewgraphs
	I put up on the web point at the home page, which points
	at Red Hat software, because it's out of date.  I guess I'd
better
	fix that as soon as I fix the ChangeLog.

&lt;p&gt; 0730	OK.  Fixed the changelog.  Time to release the
development version 'k'.
	Need to find my freshmeat password, so I can follow the
"official"
	procedures I documented on the web.  Better go get my
PalmPilot
	Freshmeat has changed a lot since was there last.  I need
to
	announce it on freshmeat and on the lists.  Found it.  I
need to
	change lots of stuff to conform to the way freshmeat is now
set up.
	This may take a while.  I guess the freshmeat II rewrite is
pretty new
	(last of Jan), so I would have had to do this, and now
wasn't too bad
	a time.
0810	Now I need to announce it to the mailing lists ;-)
	Discovered a few minor glitches on the web page.  Netscape
crashed :-(

&lt;p&gt; 0832	Got the release notices out to SuSE internal sources
and the various
	external HA mailing lists.
	I keep a little whiteboard with my near-term TODO list on
it.
	Here's what's on it right now:
		Fix "sitemap" program
		Read CVS book
		Update TODO list on web
		Post Talk VGs (already did that)
		Test New version (already did that)
		Update home page
		Work more on test scripts
		Move disk drive to "servidor"
		Email:
			OSCAR folks
			Baytech weirdness

&lt;p&gt; 		Release Unstable version (already did that)
	I'll update the board.  Fortunately it's easy ;-)
	Done.

&lt;p&gt; 0838	I guess my next priority ought to be to fix the home
page, since
	potential SuSE customers will be reading it, and it says I
work for
	Lucent and I recommmend Red Hat.
	It's more than a year out of date - a bit embarassing :-(
	OK.  Updated the home page (a little).  That didn't take
long.
	Now I'll tackle the TODO list
	Then either sitemap or some CVS book reading.

&lt;p&gt; 0852	Done.  On to the todo list...
	Dropped a note to the Linux Weekley folks about their poor
choice in
	names since it conflicts with LWN - my favorite Linux
publication ;-)

&lt;p&gt; 0955	Finished the TODO list, and announced it.  Hmmm... What
next?
	Guess the CVS knowledge is pretty sorely needed at this
point.
	I'll go read for a while.  I need to know about
how/when/why to set
	up CVS branches.  I need to add some for linux-ha, I
think...
	Short-term todo list looks much nicer now ;-)
	Of course, to understand this, I have to know something
about CVS
	tags, too ;-).  I also wrote a little script which tags my
CVS
	tree with a tag derived mechanically from the release
number.

&lt;p&gt; 	Got some question email about heartbeat - answered.
	Got some question email about AutoMake/Build - answered.
	Got a suggestion about the ToDo list.  Incorporated it.
	Ate lunch.  Took about 10 minutes.

&lt;p&gt; 1140	Now on to "sitemap"...
	What are the symptoms?
		Directories with index.html in them aren't made into
			links.
		The directory LWCE-NYC-2001 is omitted from the
			directory name displayed.  The links under
			it are all fine.
		Sorting should be case-insensitive
		It treats some files as directories.
		Perhaps the dirname() function is screwed up?
		Seems so.  Sorting is still case-sensitive.
		Oh.  It's doing perl sorting. Fixed it.  fixed file
sorting, too.
		Somewhow we're not picking up the title, etc. from some
pages.
		$Title and $X-Meta-Description are missing from them...
	Something appears to be wrong/changed with HTML::HeadParser
	It isn't always returning the info to us...
	It seems to have something to do with the DTD line netscape
	puts in.  It doesn't like it.  I need to remove it.
	Sigh...  25-30 page edits later...

&lt;p&gt; 1340	Got them all removed.  The index looks much better, but
	still isn't quite right...  Sorting is still off...
	Of course, all the modification times are all wrong :-(
	I should have tried updating to a newer version of the Perl
packages.

&lt;p&gt; 1355	Fixed sorting.  Now I know why I was avoiding this ;-)
	Site map all better now.

&lt;p&gt; 	I'm worried about getting DSL service when I move.  It
seems
	that there will be a 2-week delay after moving in and
getting
	a basic phone line installed.  This would mean I'd have to
use dialup
	for about 2 weeks :-(

&lt;p&gt; 1425	OK...  Back to working on CTS...
	I think I'll add the "monitor" function to IPaddr next.
	The basic thing is to "ping" the address.

&lt;p&gt; 1450	Done.  Committed to CVS.
	Now change the code to actually use it in the audits...

&lt;p&gt; 1520	It looks like the tests should be pinging the node to
make sure
	it's really serving the IP address as we go along.
	And, we should be verifying that all resources in a group
are
	being served by the same node.
	Oh...  Except I haven't put the latest version of the code
on
	the test cluster which means it ought to be failing (!?)
	OOPS.  It wasn't actually being called.  The if-condition
	was too complex.  It's a little simpler now, and now it
	fails like it ought to ;-)
	I distributed the new IPaddr script to the lab machines.
	It seems works now.  I'll restore a little of the debug
logging
	to make sure...  Yep.  It's working.

&lt;p&gt; 1600	I'll start a series of tests running.  They take around
8 hours IIRC.
	I need to check mail before quitting for the evening.
	Not a lot of mail.  Only 7 new emails.
	Martin Konold pointed out I forgot to mention the download
URL.

&lt;p&gt; 1610	It's now corrected both inside SuSE and outside.  Time
to quit
	for the evening.

&lt;p&gt; 2100	My freshmeat entry got thrown away.  I'll need to
resubmit
	it and the information on the main branch.  Sigh...
	Got an email from Volker.  It needed a reply.
	I sent out a Call for Refinements for heartbeat
	I send out a wish list for what apps people want to make
HA.
	I bought two more KVM cables.  Now I can hook all the
machines
	up to the switch.
	Wired up another computer to the KVM switch.  I'd wire up
the
	other two except I need to wait for the tests to finish.
	Speaking of tests, 380 or so have already succeeded.
	Only 120 more to go ;-)  Ted Ts'o sent me mail about the
Lucent
	winmodem problems, next chapter.  I sent him a brief reply.
Sigh...

&lt;p&gt; 2210	Bye for now.  395 tests run so far.

&lt;p&gt; 2001/02/06
=================================================================
0620	All 500 tests completed successfully.
	Looks like my mails to the list and to Volker have
generated some
	responses.  It'll take a while to go through them.  Most of
the
	responses were pretty much what was expected.  But, I'll
update the
	ToDo list with a couple of them anyway.
	Composed an email to send Volker and Markus about staffing.
	Responded to more email.  And more email, and more email.

&lt;p&gt; 0900	Time to process more email.
	Some from Lars, some from the ha list, some from others.
	Need to check the web stats and see how many downloads have
	occurred of the new code, but it's probably too soon to see
them
	in the reports yet.

&lt;p&gt; 1000	Time to finish hooking the cluster up to the KVM
switch.
	Done.  Now, what next?  I'm getting pretty close to being
happy
	with the test environment as it stands now.  But, I still
need
	the "environment" dimension.  I guess that's a good next
step.
	Also add "quorum" to the ClusterManager class.
	HasQuorum() added.  Looks like it works.

&lt;p&gt; 	Let's see if I can remember all the things we still need to
add to
	the test code.  I'll go reread the email on it...
	The main thing remaining was Scenarios.  Scenarios were the
idea
	that we might run a particular set of configurations like
	what kind of resources, or what kind of workload either
from the
	test machine, or workload running on the cluster machines.

&lt;p&gt; 	Need to drop Lars an email about the state of the test tool
and
	the HasQuorum member functions.  On second thought, I'll
save that
	until I'm done.  Otherwise too much time is lost.
	Now on the "Scenario" concept...
	Worked on it a while, went to lunch (took nearly an hour
today -
		getting out of the house was wonderful - a good break from
		the more-usual 10 minutes)
1225	Got back, got some detailed mail from SGI about CTS. 
Am writing
	a detailed response.  This is taking a while.
1345	Finished.  Now back to the scenarios...

&lt;p&gt; 1435	I now have the code for a basic, robust StartUp
scenario.
	Wonder if it works? ;-)
1530	It seems to work now. It's also integrated into the
RandomTest class.
	Hmmm...  It seems the Quorum changes didn't all make it
into CVS.
	I'm putting them back in.

&lt;p&gt; 1600	Bye for now.

&lt;p&gt; 1915	Got lots of email to respond to.  Looks like some folks
at HP
	may want to use heartbeat in a product.

&lt;p&gt; 2023	Bye for now.

&lt;p&gt; 2055	I just can't seem to stay away.  More email responses
(~15 mins).
	Back to the home network configuration ;-)
	Got an emergency request to make more free space on some
FAT partitions.
	I'm doing that now in the "background".
	Looks like 338 tests successful so far with new version.  I
now
	have CVS access from "servidor" too, so things are easier
to do
	right now ;-)

&lt;p&gt; 2230	We're now up to 430 tests successfully done.
	Tomorrow I need to:
		Do paperwork for LWCE/NYC trip :-(
		Attend the "All hands meeting" conference call at 1100
		Write some kind of nasty ScenarioComponent for something
			like web server traffic or memory hog or CPU hog
			or generic network traffic or swap hogs, or something.
			A flood pingfest comes to mind as being a good place to
			start &amp;lt;;-)
		Move big disk to backup machine
2300	Bye for now.
2330	Changed my mind.  Going to add a VerifyAllIdle action
to
	the ResourceManager script tonight and then invoke it from
the startup
	script.  This will give folks who make one of the two most
common
	errors a good clue that they made a mistake.  The guys from
HP
	made this common error, and I've had it with this problem!
2345	All 500 tests passed.
0016	The new code for the verifyallidle action is in, and
activated.  It
	seems to work just fine.  Now to update the ChangeLog.
0030	All put in CVS.  Send email to the HP guys ;-)
0040	Bye for now.

&lt;p&gt; 2001/02/07
=================================================================
0605	Checked email.  A number from Lars, a couple from HA
lists, Lenz.
	Sent replies, filed.
0710	Find receipts for LWCE.  Start expense report.  Process
more email.
0940	Paperwork done.  Need to send it out.  Now on to the
nasty pingfest
	ScenarioComponent.  Should be fun ;-)  Looks like the last
batch
	of 500 tests finished successfully at about 09:25.

&lt;p&gt; 1045	Looks like the PingFest flood ping test is working -
perhaps a
	little too well ;-) The tests are running really slowly -
but
	they're working!  The switch port lights are on pretty
nearly solid ;-)

&lt;p&gt; 1100	Went to the conference call.  SuSE is letting most
everyone go
	here in the US.  Looks like I get to find a new job ;-)
	Update resume, phone call, interview
	Repeat until new job.

&lt;p&gt; 2001/02/08
=================================================================
2001/02/09
=================================================================
2001/02/10
=================================================================
</description>
    </item>
    <item>
      <pubDate>Thu, 15 Feb 2001 16:08:19 GMT</pubDate>
      <title>15 Feb 2001</title>
      <link>http://www.advogato.org/person/alanr/diary.html?start=4</link>
      <guid>http://www.advogato.org/person/alanr/diary.html?start=4</guid>
      <description>2001/01/28
=================================================================
0350  Had trouble sleeping.  Prayed for a few people.  Got
up.
	The case is the "restart" test when only one machine is up
	looks like it had an obvious bug.  It said:
&lt;pre&gt;
		if node == self.CM.OurNode:
 			pat = self.uspat
&lt;/pre&gt;
	But it should have said:
&lt;pre&gt;
		if node == self.CM.OurNode or self.CM.upcount() &amp;lt; 1:
 			pat = self.uspat
&lt;/pre&gt;
	instead.  I applied the fix on "servidor".  Looks like it
was having
	some problems with X11 forwarding too.  I changed the rsh
command
	to supress forwarding X11 ports (sinc I don't need them).
	Looks like there's also a bug in the Stonith test such that
it
	doesn't look for the right patterns if the other node is
down.
	Another wee bug, this one slightly more subtle:&lt;pre&gt;
	    if (self.CM.upcount() == 1 and
self.CM.ShouldBeStatus[node]
        	==      self.CM["up"]):
&lt;/pre&gt;
	Should have simply been:&lt;pre&gt;
	    if (self.CM.upcount() == 1):
&lt;/pre&gt;
	I decided the logging should part of the CtsLab class.
	So, it's now in the (as of yet not used) CtsLab class.
	Another wee bug, this time in the Stonith code:
&lt;pre&gt;
	   if (self.CM.upcount() == 1):
&lt;/pre&gt; 
	should have been
&lt;pre&gt;
	   if (self.CM.upcount() &amp;lt;= 1): 
&lt;/pre&gt;
	It's fixed now.

&lt;p&gt; 0615  Sent out CtsLab code to list. Probably ought to take a
nap before
	church ;-)

&lt;p&gt;       Somewhere along in here I spent an hour or two
packing.

&lt;p&gt; 1615  The entire 500 tests all went successfully. 
Definitely fixed the bug,
	since I ran it with the same random number set...
        Some other minor bugs having to do with reporting at
the end
        were introduced.  I think I fixed them.
1715  Continuing to write the Lab class.  Break time.
2000  Break over.  Hope to get the lab class integrated and
working tonight.
2300  Looks like they're working together fine.  Better quit
while I'm ahead
      after backing things up ;-).  G'night.
	
2001/01/29
=================================================================
0700  Today should be mostly a packing and preparing to
leave day.
      I have some loose ends to take care of before I leave,
but today I
      have a car, but of course it's snowing pretty nicely
outside ;-)

&lt;p&gt;       However, I'm going to have to look at the heartbeat
code anyway,
      because it looks like I triggered a bug in the
heartbeat code
      with the tests.  I guess that's what testing is
supposed to do ;-)
      The test code "hit the jackpot" . "Both machines own
foreign resources".
      The evidence should be in the logs.  I'll see.  The
error occurred
      about 3 hours into the test run.

&lt;p&gt;       The problem is caused by the machine which had just
come up (sgi2)
      failing to hear any heartbeats from the machine which
was up all along.
      Perhaps this is caused by a piece of the code in the
takeover sequence
      which waits for the takeover to complete, hence
keeping packets from
      being sent out.

&lt;p&gt;       Another possibility would be that it is a problem in
the receiving code
      startup.  This sounds more likely.  Perhaps the
startup code should
      be more synchronous.  This is what the timing looks
like:
	Jan 29 02:03:56 sgi2 Starting heartbeat 0.4.8k
	Jan 29 02:04:00 sgi2 UDP heartbeat started 
	Jan 29 02:04:01 sgi2 WARN: node sgi1: is dead   

&lt;p&gt;       OK, that's technically our deadtime (5 seconds), but
we didn't give the
      other guy much of a chance to give us a heartbeat,
because we were not
      yet up very long.  With a heartbeat interval of only 1
second, this is
      almost impossible.  Under heavy load or with a 3
second dead time, I
      could imagine this being much worse.  I think I
remember wondering
      if this could happen before.

&lt;p&gt;       Sounds like we should start the timing of "dead" time
from the moment
      we receive an ack that all of the child read/write
processes are
      up and running.  I guess that means that the code
needs to send such
      ACKs and that the heartbeat core timing logic needs to
track them
      and modify it's idea of the "epoch" accordingly.

&lt;p&gt;       I guess this is great progress!  I've moved from
debugging the test
      tool to debugging the thing it's testing!  Now I just
need to think
      carefully about how to fix this bug in heartbeat ;-)

&lt;p&gt; 0800	I sent out last week's journal, and saved a similar
email as a template
	to make sending it out in the future easier.  I'm going to
go finish
	packing now, and come back to the bug later.

&lt;p&gt; 1000	Finally finished packing!  Now to run errands and do
all the other
	things I need to do before leaving town for a few days.

&lt;p&gt; 1345	Got home and am doing a little more cleaning up,
reading email, etc.
1405	Gotta go get Laura from Mandalay (work).
1430	Went to go see the builder of our house and try and
straighten
	out some things in how the house is put together.
2000	Checking email, printing off schedule document.  Need
to stop this to
	order some Orinoco cards, and go to bed... (to about
2100).  Finally
	got to bed around 0000.

&lt;p&gt; 2001/01/30
=================================================================
0430	Today I leave for LWCE/NYC.  Expect less detail in the
	subsequent entries, since I'll spend most of my time away
from my
	laptop.  Better pack up the laptop, etc ;-)
	Got everything packed and made the plane on time, etc.
	Trip went without incident.  Coded a little  fix to the
timing bug
	I discovered with CTS.  Watched the movie.  It was
	"Remember the Titans".  I recommend it highly.
	Arrived in NYC a little later than planned.  Spent about a
half-hour
	trying to get my cell phone to work in NYC.  It was a pain,
my
	vendor needed to take some special security precautions to
keep my
	NAM, etc. from being stolen and someone from making calls
on it.
	Annoying.

&lt;p&gt; 	I took the shuttle to the hotel and checked in fine.  By
the time
	this happened it was a little too late to make it to the
Javits center
	to check in today.
	Worked a little on the code.  Got the timing fix "mostly"
working.
	Got a call from Horms, and went to dinner with him and his
buddies
	from VA.  Had a good discussion about where I want
heartbeat to go
	and what he wants to do with it also.  Ate an Aussie meat
pie.
	It was pretty good.  He said it was a little higher-class
pie
	than you'd often get in Australia.  Went home about 2300. 
Got to
	bed around 0000.
2001/01/31
=================================================================
0700	Really tired this morning.
	Made it to the Javits Center about 0830 or
	so.  Talked to LWN staff at the speakers room.  Got
registered
	both as speaker and as Exhibitor.  I did LOTS of
appointments today.
	Stacey Quandt from Giga didn't show, but everyone else
did.  I also
	spoke to a freelance journalist who had very similar ideas
about
	the "small" enterprise and what they need from HA.  He had
heard
	me speak when I was at Bell Labs in Naperville and dropped
by to see
	me.  Here was my agenda, which was mostly followed:
	1000 Ben Rafanello and friends, IBM
	1115 Stacey Quandt (no-show)
	1200 D. H. Brown
	1400 Jon Doyle &amp;amp; Compaq
	1500 Dean Pannell
	1500 Peter Badovinatz (IBM) at Developers' Den
	1830 IBM Party.  Spent a lot of time with Peter B (Wombat)
	Learned some very interesting things from Peter, what he
said, and what
	he didn't say.  Glad I spent the time with him.

&lt;p&gt; 	It was a long, busy, productive day, and I don't have much
voice left.
	I'm going to have to be careful, or I won't have any voice
left for
	my talk on Friday.  I'll take some throat losenges with me
tomorrow.
	It seems to me that the show has been pretty good as far as
size and
	people coming by.  I also talked to Ted Ts'o about the
Lucent
	winmodem debacle, and also with someone from IBM (Frank
Novak
	fnovak@us.ibm.com) who will help ensure that Lucent does
the right
	thing.  I need to tell Ted about him.  I also met Patrick
Martel
	of MandrakeSoft.  Dan Cox of Compaq told me to contact
Wayne Opland
	about the HA disk (512) 432-8146.
	
2001/02/01
=================================================================

&lt;p&gt; 0630	Got up, pulled down email, finished the fix for the
timing bug.  It
	seems to work fine now.  Wrote a reply to Markus and Jay
asking that
	they tell me sooner rather than later if they have feedback
on how
	I spend my time :-)  Updated CVS with the timing fix.
	Getting ready to go to the Javits Center.  Maybe I'll have
a little time
	to look around on the show floor today :-)
	Spent about a 20 minutes writing up the notes from the show
so far.
0810	Go to Javits Center.  Bye for now ;-)
2345	I spent the whole day at the show, mostly talking to
potential
	customers, suppliers, partners, etc.  My appointments today
were
	with Thomas Schaffner of Enterprise Linux, Mike McQuaid of
Winchester
	Systems, and Peter Badovinatz of IBM.  I talked to lots of
other
	people though, including one person from Lawrence Berkeley
	Laboratories who might be interested in having us provide
professional
	services to help him deploy a high-availability web
server.  I also
	talked to Oracle about HA issues, SGI, and various other
people
	whom I've forgotten.  I did get finally get out of the
booth an hour
	today to look around.  Bought a book.  Got a few goodies.
	I worked with Joshua Uziel (uzi) to fix a byte-ordering bug
that the
	findif.c code had.  He packaged it up in a patch and mailed
it to
	me.
	Other people I talked to:  Shane Painter of Dell (whom I
met in
	Austin), Eric Lam of Coventive (interesting hardware
model), Nate
	Perlstein of SGI (FailSafe support), Charlie Simpson of
Enterprise
	Linux, and Satoshi Kawata of Red Hat Japan.

&lt;p&gt; 	I stopped by the Mission Critical Linux folks and it sounds
like
	they may end up using our open source test tool to help
test
	their clusters.  Right now they test everything by hand.

&lt;p&gt; 	I bought my first meal since leaving home.  Everything else
has
	been freebies and a snack or two ;-)

&lt;p&gt; 	I had a great conversation with our IBM liason (Malcolm?). 
It seems
	that he didn't know that SuSE had any HA efforts.  I
corrected
	this misimpression.  It was a really good thing I think. 
	It sounds like he may have me go meet some IBM folks. 
	Apparently Malcolm has good news regarding our relationship
with IBM.
	Better go to bed now, and get up to work on my talk
tomorrow morning.

&lt;p&gt; 	An aside:  Apparently John Mehaffey mentioned us in one of
	his talks.   At least 2 or 3 people come by to see me as a
result.
	I'll drop him a thank you note.

&lt;p&gt; 2001/02/02
=================================================================
	Today I give my talk, and I return home.
0500	My stomach was a little unsettled, so I went ahead and
got up.
	I need to reread my talk and see if I can/need to add
anything
	regarding the various APIs to the talk.  Get dressed, do a
	little packing, etc.

&lt;p&gt; 0545	Begin rewriting talk to change emphasis to Linux-HA
APIs from
	being a heartbeat talk.

&lt;p&gt; 0645	Began a runthrough of the talk.  It took about 45
minutes.  It should
	fit in the time alotted.  I'm a little worried about it
being a little
	short.

&lt;p&gt; 0740	Start to pack up in earnest.  Am tired already.  Sad
state of affairs.
	Better locate my Penguin mints for later ;-)  Took a little
nap
	before leaving.

&lt;p&gt; 0910	Time to pack up the laptop and leave for the
conference. 

&lt;p&gt; 1950	Went over to the Javits center by cab.  Arrived about
10 AM.
	I run into Liz and Michael Hammell from the Linux Weekley
News.
	It turns out that Liz is returning to Denver on the same
flight
	I am.  We make arrangements to share a ride to the airport.

&lt;p&gt; 	Went by the booth.  Talked for quite a while with Anas
about
	clustering issues and then with Andreas Archangelli mainly
about
	debugging tools.  I'm glad he has a better attitude about
them than
	Linus does.  Maybe I ought to duplicate some of the "klog"
tools
	for Linux.  Wonder if Avaya would open source them?  Maybe
I should
	have Roger or someone send me some klog output (if he could
get
	some easily) so I could show it to Andreas.

&lt;p&gt; 	I went to go hear Dirk's talk - a little late.  Dirk seems
	well-prepared and has a good talk.  My PalmPilot alarm goes
off
	near the end.  It's time to go check out the room I'll give
my talk
	in, and run through a little of it.  I discover I'm more
nervous
	than I'd guess.  I wonder if anyone much will show up for
the
	last talk in the conference?  One couple shows up 30
minutes early(!). 
	Others show up shortly afterwards.  Doesn't sound like I
have much
	to worry about.  After a few minutes I sit down with the
people
	who've come in and talk to them.  It was nice - seems to
calm
	down my nerves.  I find that a guy from Bloomberg financial
	services that I met before is here.  He's a Russian (?)
guy.
	I get his card.  I'm supposed to send him a copy of the
slides
	from today.
	
	I don't know the routine here.  Will someone introduce me? 
When
	should I start?  About 2 minutes after, I decide that no
one will
	introduce me, and I'll start my talk now.  I don't find any
controls
	for the lights, but someone in the audience tells me and I
get
	the lights dimmed.  By a few minutes into the talk there
are 40-50
	people in the room.  Nice turnout.

&lt;p&gt; 	I get my first question.  It's very confusing.  It takes a
few
	minutes to figure out what he wants to know.  I'm about to
cut
	the discussion off when I figure it out and answer it. 
Now,
	more questions come.  I'm beginning to warm up, and my
sense
	of humor takes off and the audience laughs.  Now I'm having
fun,
	have lots to say, and they ask lots of questions.  The talk
finishes
	at almost exactly the right time!  It went very well.  They
were a
	good audience.  [I agreed to put the slides up on the
Linux-HA site].

&lt;p&gt; 	A fellow from LynuxWorks wants to talk to me.  He's on the
	mailing list (but I don't remember him too clearly).  He
thinks they
	might put some resources on the Linux-HA project.  He tells
me
	they are going to open up the Intel High-Availability forum
to
	other people - he implies that he means people like me,
perhaps
	me specifically. [I look up email from him later, and I
realize
	that he's a fellow I accidentally insulted on the list. I
guess
	he must have forgiven me].  Liz rings and says she wants to
say bye
	to folks and will call me a little later.

&lt;p&gt; 	I go to bag check to get my coat, and bag and go up to the
booth
	to talk to folks before Liz calls again.  I chat a bit, run
into a
	guy from Conectiva.  I get him some small SuSE souvenirs
for
	himself and my friends at Conectiva (Marcelo, Olive and
Luis Claudio).
	Olive runs SuSE on his machine ;-) My phone rings, and it's
Liz.
	Time to go.

&lt;p&gt; ~1530	We get a limo and ride to the airport.  It was a bit
more expensive
	than I'd like, but it was starting to rain and lots of
people are
	looking for rides, so we take it.

&lt;p&gt; 	There's another fellow in the car with us, so we all chat. 
Liz wants
	to know about his company.  He reads Linux Weekley News,
and seems
	to have heard of heartbeat.  So we all have something to
talk about.

&lt;p&gt; 	We arrive at the airport, in plenty of time.  All is well. 
We
	exchange travel horror stories.  It seems Liz has a bit of
a
	travel problem phobia, and has had a few experiences to
match.
	She's going to go to talk at LinuxWorld Expo in Singapore.
	She agrees to give me a ride home (it's not far out of the
way).
	The bus is fine, but being dropped off at home is nicer.  I
realize
	that I left my Minidisc player with Stephen Ing.  Oops! 
Liz also
	says that the LWCE audiences rate the speakers on a 5-point
scale.  I
	wonder how my talk was rated?

&lt;p&gt; ~1810	We load up on the plane.  After we're enroute, the
pilot thinks
	we'll be in Denver 30 minutes early.  He seems skeptical of
his
	flight computer ;-)  So am I.  I nap until they turn off
the
	seat belts sign.  They bring dinner.  It's not too bad. 
The movie
	comes on, and I dig out my laptop for this report.  It took
me
	20 minutes or so to write up the part after 0910.

&lt;p&gt; 2033	I switch my watch to Denver time.  Now it's 1834 ;-)

&lt;p&gt; 1836	I decide to write Stephen an email, along with one to
the Russian
	fellow, and the one I need to send Ted Ts'o.  If I feel
like it,
	I'll try and catch up on the email from the list as well. 
I'll
	send John Mehaffey a note of thanks too.  I added Brian's,
	Alexender's, and John Mehaffey's info to my address book.


&lt;p&gt; 1917	I sent those emails.  Now I'll try and catch up on
other email.
	I applied Uzi's byte ordering patch.  I'll try it when I
get home
	and have a network.  I also need to send email to Rudy
Pawul about
	or the Enterprise Linux people.
2019	I got rid of around 100 emails, and replied to many. 
I've got about
	another half-hour to go on the flight.  Guess I'd better
figure out
	how/when to finish up.  Still need to email to/about Rudy.
2025	It's getting rough up here.  Better shut down and put
up the laptop.
	Bye :-)

&lt;p&gt; 	I had a most pleasant return trip with Liz and her family. 
They
	very kindly just dropped me off at home.

&lt;p&gt; 2001/02/03
=================================================================
0800	Downloaded, read and replied to a little email.  About
an hour I
	suppose.  Wife and I both tired, cranky :-(
	I tried to grab email mid-afternoon.  DSL down :-(
	Got it back up in about a half-hour of time with Qwest.
	Very tired after the show.  Zzzz.

&lt;p&gt; 2100	Read, replied to more mail.  Updated main and
commercial pages
	on linux-ha web site.  Thought some more about the upshot
	from my talk.  There is a lot of interest in HA things, and
in
	particular I MUST split out the core code from the cluster
	manager code.  This has to be a near-term development
priority.
	Users want it, Anas needs it, others too...  It just
becomes
	way more useful that way.  I believe the development
especially
	from others is blocked because of this.  I VERY MUCH need
	to update the TODO list.  It's WAY out of date.
	Another thing to add to the TODO list:  Make the
configuration
	code plug-in modules, too...

&lt;p&gt; 2210	G'night.  It's 0010 East Coast time now.  No wonder I'm
tired.
	I need to update my personal todo list from this journal
next week.
	I'll send this out to my loyal readers ;-)
</description>
    </item>
    <item>
      <pubDate>Thu, 15 Feb 2001 16:00:42 GMT</pubDate>
      <title>15 Feb 2001</title>
      <link>http://www.advogato.org/person/alanr/diary.html?start=3</link>
      <guid>http://www.advogato.org/person/alanr/diary.html?start=3</guid>
      <description>2001/01/27
===========================================================
0530  Woke up.  Decided that this is 0730 NYC time, so it
must be time to
	get up ;-)  Checked mail.  Went back to
restructuring/enhancing the
	the test code.  Still having occasional problems with
python naming
	and modules, but now think I have a good strategy worked
out for
	using it.  Looks like the latest iteration of restructuring
is now
	working.  Guess I'd better go off and figure out what to do
next.
0800  Enough for now.  I made some pretty good progress on a
couple of fronts.

&lt;p&gt; 1055  Joined Joe Barr on his recording bridge.  He had 10
questions to ask,
	and I answered them as best I could.  He's now interested
in HA things
	and may write an article on it.  He'll likely give me
another call
	if he decides to do so.  Interview got over at about 1130. 
I think
	he got the sound bites he was looking for.

&lt;p&gt; 1240  Looks like the test scripts show some failures. 
Looking at the logs
	the heartbeat code is working right, but the test code
doesn't think
	it is.  The case is the "restart" test when only one
machine is up
	looks for the pattern for "remote machine has joined" when
it should
	be looking for the pattern "local machine has joined"
instead.
	Don't know why yet.
1315  Gotta go to Castle Rock and then going-away party for
my cousin :-(
	Bye.

&lt;p&gt; 2001/01/26
=================================================================
0640  Really tired this morning.  Guess I'm getting too old
to get less
	than 6 hours sleep too often.  Laura is feeling a little
better this
	morning, so she went to school today.

&lt;p&gt; Surprisingly, no reply from Lars on the CTS code.  Ahhh...
It just came in,
in 2 parts.  I responded to one of them.  Decided to clean
up my Trash folder,
as it has over 10K unread messages in it.  I'll get rid of
all Trash from last
year.  Good to take out the trash once a year whether it
needs it or not ;-)

&lt;p&gt; 0800  Need to get dressed, etc.
0825  Back to work.  I need to look over the current copy of
"Enterprise Linux"
	it has a pretty cool cover article about the Weather
Channel.
	Didn't actually read the article yet.  Responded to Lars'
comments.
0940  Finished responding to Lars' comments about CTS. 
Started implementing
	some of them.  Splitting into multiple files, separating
out the
	audit class.
1030  I'm exhausted and have a headache.  Time to take a
decongestant, a break
	and maybe a nap.  Maybe I need to eat something?  I see
that we've
	done 602 iterations of the heartbeat code without any
errors.  This
	time I'm including the Stonith test in my set of tests to
run
	(it slows things down a lot).

&lt;p&gt; 1055  Back to the salt mines :-)  I feel a bit better.  I
see we're up to 655
	iterations.  I wrote the Audit class.  I guess I'll stop
the
	ongoing tests on 'servidor' and actually try the
restructured code
	and see if it works, as opposed to "just compiles"

&lt;p&gt; 1145  Headache is back.  Time for something stronger... 
Time for lunch...
	Went to lunch.  Received a few boxes full of hardware for
installing
	the network.  Spent about 45 minutes checking the stuff
out,
        making sure it was all there etc.  Laura came home
sick and exhausted,
        took her to Lunch, since she hadn't eaten.
	Still don't feel right.  Took a half-hour nap.  Spent a
half-hour or
	so helping Amy get xawtv working on her PC, without much
success. 
	Having trouble importing some Python classes. 
	Learning curve, I guess... (could it be a Python bug?)

&lt;p&gt; 	Got email from Paddy about possible FailSafe meeting times.
        Replied, told him to avoid the CLIQ, 'cause I'm
running a BOF
        (and representing SuSE?) there.
	Got email from Mia with corrected arrival date for hotel.

&lt;p&gt; 1740  Time to call it quits for a while and get Laura (and
me) dinner.
2000  Called Joe Barr, and set up the appointment for the
interview
	tomorrow at 11 AM.  He seems like a really nice guy.  I'm
now writing
	the code for the CtsLab class.
2120  Tired. Going to bed.  But, I feel better than I did
earlier today.

&lt;p&gt; 2001/01/25
=================================================================
0525 Started work.  This will be an odd day.  Thursdays
always are. 
	Today a little more so than normal.

&lt;p&gt; I see the overnight run I made crapped out after about 15
minutes because I
had too many open files.  Hmmm...  Never saw that before...
Not surprisingly, it was in the new AuditResources code...
It was doing a popen for determining if the other node is
up.
I'm not waiting for the child process to finish before going
on.
I'll see if waiting for it to finish helps...
I see it's gone 130 iterations this time.  Before it only
went 60.
That's a good sign.  Looks like that fixed it.  It's been &amp;gt;
300
iterations.

&lt;p&gt; I got email from lars about the CTS.  I've been responding
to it.
He has some good comments.

&lt;p&gt; 0615 Need to get dressed to take Kathy to school so I can
have a car
	today.  My wife is sick, my mother-in-law has an infection
from
	her surgery and my father-in-law and I both have doctors'
appointments
	today...

&lt;p&gt; 0700 Back to work...

&lt;p&gt; I'm continuing to respond to Lars' email.  He made a couple
of good points,
	and some I don't care about.  Completing my reply took
exactly an hour.
	We're now up to 580 successful test iterations.

&lt;p&gt; 3-4 people subscribed to the linux-ha-dev list today. 
Replying to them took
	until 0920 or so.  More email, more travel planning...

&lt;p&gt; 1025 Time to go to Doctor's appt.

&lt;p&gt; Went to Doctor's, did about 15 mins of coding, went to lunch
with a good friend
who needed some time to talk.  Got done about 1400.  Picked
up Kathy from school at about 1440

&lt;p&gt; 1500 Started back to work.  Lots of email arrived while I
was gone. They
	changed my hotel reservation, so I have to print off new
stuff
	to carry with me and tell Wombat new hotel name.

&lt;p&gt; Included in the email was a VIRUS ALERT, TELL ALL YOUR
FRIENDS! ;-)

&lt;p&gt; Apparently disconnecting my laptop stopped the tests
running.  I had about
1100 iterations at that point.

&lt;p&gt; Got an subscription email from a commercial HA firm.  I sent
them the
same "welcome to the list, what brings you here?" note I
send everyone.
It'll be interesting to hear what they say.

&lt;p&gt; 1645 Need to go make preparations for dinner, etc.  Laura
stayed in bed all
day.  No word from my in-laws yet on how they did.

&lt;p&gt; 2330 Decided to check mail and read about the worm.  Took
about an hour.
	I see the tests I had started finished just fine.  G'night.


&lt;p&gt; 2001/01/24
=================================================================

&lt;p&gt; I started work this morning about 7:15.

&lt;p&gt; I spent the first two hours this morning dealing with email
and talking to
Lars on IRC.  He now knows my situation and a bit more about
the priorities
in SuSE, Inc.  I agreed to write up a few paragraphs on the
Cluster Test
System (CTS) for him.

&lt;p&gt; I made a doctor's appointment for tomorrow morning so I can
get some
prescriptions refilled before taking off to NYC. (about 15
mins)

&lt;p&gt; I spent about an hour or so writing up the CTS for Lars.

&lt;p&gt; I spent about 15 minutes explaining to MilesTek about the
troubles
I had ordering equipment from their web site.  I scanned in
some pages and
emailed them out.  Sigh...

&lt;p&gt; Responded to some email from MC Linux about Stonith. 
They're considering
adopting it, and had a few questions about the expect()
function in it.
My reply seemed to satisfy them.  Guess that's good.

&lt;p&gt; Took off for lunch at 12:17 PM, returned as 14:10.  Had to
make a trip by
the house and pick up Laura from work.

&lt;p&gt; Set up an appointment with horms for Wednesday dinner.

&lt;p&gt; I wrote the code to tell if some, all or none of the
resources in a group
are held by the current node.  Probably even works ;-)

&lt;p&gt; Doing conference paperwork: Scheduling things, getting the
current schedule
for the conference room, etc.  This will probably take me an
hour to do.
Meeting with Horms (VA Linux), Ben Rafanello (IBM), Wombat
(Peter Badovinatz @ IBM), Thomas Schaffner (Enterprise
Linux),
Mike McQuaid (Winchester Systems).  I also talked to Jon
Doyle for a
half-hour or so somewhere in here.

&lt;p&gt; Sent some email about the heartbeat API to Ericcson in
Montreal.  Took about
15 minutes to write.

&lt;p&gt; 1645	quitting work for a while (Dogs are going nuts, and
wife is sick).
2112	back for a bit.  Gonna work on the resource stability
thing...
	Finally backed up the laptop ;-)
2200	Going to bed.  Got the new cts.py code working
including polling
	for resources to become acquired.


&lt;p&gt; 2001/01/23
=================================================================

&lt;p&gt; I started work this morning about 7:30.  I took about a
half-hour off for
lunch.  I stopped around 4:45 or so and put in a half hour
or so later
in the evening to catch up on email, etc.

&lt;p&gt; More updates to the test suite.
Basic Resource Auditing works!  It's now in CVS too.

&lt;p&gt; Need to get the CTS harness to not audit resources too soon.
It looks like the IP addresses aren't getting set up as fast
as the auditing
is taking place.

&lt;p&gt; Further examination seems to bear this out, but the
heartbeat code doesn't give any particular message when the
transition takeover scripts have completed.
I put in a little code to loop for a while re-auditing
things until they
get better.  They always seem to get better at least ;-)

&lt;p&gt; There are at least four possible cases:
     A machine went down:
          It held resources - we will take them over
          It didn't hold resources - we won't take them over
     A machine came up
          It will request resources (only machine, not
nicefailback)
	  It won't request resources: it has none, or nicefailback

&lt;p&gt; Or maybe it's simpler than that?
	A machine came up - resource acquisition prints completion
msg in
				all cases
	A machine went down - takeover code prints msg when done in
all cases

&lt;p&gt; What this really is is looking for the completion of a
transition.
Right now the code doesn't really know when the resources
have been
fully acquired locally.  This is not a good thing.

&lt;p&gt; I suppose what I need is a message whenever it completes
acquisition of
	a set of resources, or when it decides it's not going to.

&lt;p&gt; I put in some new messages that indicate when acquisition of
resources
completes when done by heartbeat, but not for system
failover takeovers.
Those will have to go in the mach_down script or something
like that.
I'll try and get that later tonight.
My goal for tonight is to fix this resource auditing
problem.

&lt;p&gt; It appears that this will require a new script which
synchronously waits
for resources to become served.  It would be called by
mach_down.
Or, I suppose that mach_down could just do this itself, but
this
all sounds really hard, because of the messaging model used
by the scripts.
Maybe I could use a directory in "/var/lib/heartbeat" to
keep track of
what resources have been acquired.
Or, I suppose I could poll to wait for them to be taken
over...
Yuck...  Could be worse, I guess...  Either way I think I
get to poll...

&lt;p&gt; I guess I'll just change mach_down to poll for the resources
that we are
still waiting to acquire rather than add new scripts.  This
is best done
by enhancing ResourceManager to have a groupstat command or
something like
that, then mach_down can use that without duplicating a lot
of code.

&lt;p&gt; This item (the test harness) took by far the majority of my
time.  I suppose about 60-70%

&lt;p&gt; Dropped Lars an email telling him about the updates to the
test suite.
Emailed some guy in France about publishing a Stonith paper
for an IEEE journal.

&lt;p&gt; Updated the HA web site with several minor things including
stuff for
   Kimberlite, and the Open Cluster group (OSCAR).

&lt;p&gt; Talked for a half-hour or so to Winchester Systems about
getting an
    eval unit of their multi-interface RAID box.  Made an
appointment to
    talk at NYC.

&lt;p&gt; Minor updates to the HA thoughts doc about various concerns.

&lt;p&gt; I'm worried about Samba failover, and I'm worried about NFS
failover.
Jeremy Allison thinks Samba failover is hard, but it may be
mainly an app
thing.  MC Linux has done the NFS failover and thinks it's
hard.
This may be partly smoke screen.  Maybe we can get by
without lock failover?

&lt;p&gt; Started this Journal.

&lt;p&gt; Emailed Ibrahim the suggested new paragraph for the Linux
Journal.


&lt;p&gt; 2001/01/22
=================================================================

&lt;p&gt; Spent several hours struggling with fetchmail problems.
Finally got it working again with help from Chris Mahmood. 
Oakland had
  changed a bunch of things and they didn't take effect
until a reboot
  happened over the weekend.
Wrote a bunch of code associated with resource auditing for
the test suite.
  This includes the modification of the ClusterManager class
and the creation
  of the new Resource Class.
  Committed the changes to CVS.
Wrote the "HA Thoughts" document for where we're going with
HA in SuSE.
Spent a bunch of time trying to figure out what the Baytech
is doing.
   It seems to pause for a second every 3-4 seconds, but
respond OK otherwise,
   But more ominously it seems to give connection refused
for a second or two
      every so often at seemingly random times.

&lt;p&gt; Over lunch tried to call WebGear. They seem to be out of
business!

&lt;p&gt; Updated the HA thoughts doc.
</description>
    </item>
    <item>
      <pubDate>Wed, 17 Jan 2001 17:32:50 GMT</pubDate>
      <title>17 Jan 2001</title>
      <link>http://www.advogato.org/person/alanr/diary.html?start=2</link>
      <guid>http://www.advogato.org/person/alanr/diary.html?start=2</guid>
      <description>I've decided to try keeping my journal up-to-date as a way
of tracking (and hopefully improving) my personal
productivity.  Unfortunately I had forgotten my Advogato
password.  But now I have it, so away I go...

&lt;p&gt; This entry is really for yesterday (2000/01/16).

&lt;p&gt; I spent a while trading in a plane ticket for COMDEX (which
I
didn't go to), for a ticket to LWCE at the end of the month.

&lt;p&gt; I continued a dialogue with lmb about changing the Stonith
API.  We both agree it needs to change, and I think we're
converging on how to change it.

&lt;p&gt; I integrated multicast support into heartbeat CVS.

&lt;p&gt; I integrated APC UPS support code into the Stonith
subsystem, and put it under CVS.

&lt;p&gt; Since other folks that I (mostly) don't know wrote these
pieces of code, the only conclusion that I can draw is that
this open source stuff must be working ;-).

&lt;p&gt; I wrote up some release procedures for heartbeat and posted
them &lt;a href="http://linux-ha.org/heartbeat/release.html" &gt;on
the web&lt;/a&gt;.

&lt;p&gt; I got the CVS version to build correctly again after all
these changes and put it on my test machines.

&lt;p&gt; (I'm trying to follow my own release procedures ;-))

&lt;p&gt; Things I didn't expect to do was deal with a failure of the
black printhead on my HP 2000C printer (it failed about 30%
into its expected life).

&lt;p&gt; I dealt with some folks from Avaya, and bought someone lunch
who took me to CompUSA to get the print head.   I fixed the
stupid printer, and helped the fellow who took me to get the
print head a little as he repaired our vacuum cleaner.

&lt;p&gt; Somehow my ssh setup for my labs was broken, so I needed to
repair that for my test tools.

&lt;p&gt; All in all, a reasonably productive day.
</description>
    </item>
    <item>
      <pubDate>Fri, 4 Aug 2000 00:06:30 GMT</pubDate>
      <title>4 Aug 2000</title>
      <link>http://www.advogato.org/person/alanr/diary.html?start=1</link>
      <guid>http://www.advogato.org/person/alanr/diary.html?start=1</guid>
      <description>I've been working on the heartbeat API.  It actually works
pretty well.  Marcelo found a couple of bugs in it, and
suggested restructuring a small piece of it.  So, I fixed
one of the bugs, will fix another, and let him do the
restructuring (he wanted to).  Marcelo's a good guy.  I work
closely with their people, and really anyone who's
interested.  I've issued specific invitations to everyone
who's active in this area, including those guys in NC. 
Right now, we have folks from many companies using and
contributing to heartbeat.  It's a blast!
&lt;p&gt;
Linux Fail Safe is nearing it's open source release.  We're
getting pretty excited about it.  It's by far the most
powerful of the High-Availability products, open or closed
source.
&lt;p&gt;
I hope to position heartbeat to be able to do membership and
low-level communication for lots of different projects. 
We'll write a new simple cluster manager, and use the
heartbeat API.  There is a place for an HA batch queueing
system.  Of course, it could use heartbeat ;-)
&lt;p&gt;
I hope to change FailSafe to use it.  Perhaps even the folks
at Mission-Critical Linux could use it.  SGI is eyeing it
for things I don't think I'm free to talk about.
&lt;p&gt;
It's basic, but it works pretty darn well, and gets better
all the time ;-)
&lt;p&gt;
I got some nice feedback from Eric Ayers about my talk at
the ALE (Atlanta Linux Enthusiasts) meeting last month.  If
you want me to speak to your LUG or conference about
Linux-HA, let me know.  I like giving talks.
</description>
    </item>
    <item>
      <pubDate>Sun, 23 Jul 2000 05:52:24 GMT</pubDate>
      <title>23 Jul 2000</title>
      <link>http://www.advogato.org/person/alanr/diary.html?start=0</link>
      <guid>http://www.advogato.org/person/alanr/diary.html?start=0</guid>
      <description>I guess I ought to write at least &lt;i&gt;one&lt;/i&gt; journal entry.
&lt;p&gt;
Lately, I've been spending most of my time doing at least
six different things:
&lt;p&gt;
Promoting &lt;a href="http://linux-ha.org/" &gt;Linux-HA&lt;/a&gt;.  A
week ago last Thursday (whenever &lt;i&gt;that&lt;/i&gt; was), I spoke
to the Atlanta Linux Enthusiasts.  Going to Atlanta in July
wasn't my idea of good timing (it's hot and humid then), but
they audience was very interested, and quite well-informed. 
The talk was very well received, and I even got an idea for
a useful feature in heartbeat, which I implemented a few
days later.
&lt;p&gt;
Working on reset code for LinuxFailSafe.  It uses the
STONITH API below.
&lt;p&gt;
Designing, writing, implementing and changing a STONITH API.
STONITH == &lt;b&gt;S&lt;/b&gt;hoot &lt;b&gt;T&lt;/b&gt;he &lt;b&gt;O&lt;/b&gt;ther &lt;b&gt;N&lt;/b&gt;ode
&lt;b&gt;I&lt;/b&gt;n &lt;b&gt;T&lt;/b&gt;he &lt;b&gt;H&lt;/b&gt;ead.  Also called STOMITH,
substituting Machine for Node.  I like STONITH, because of
the similarity to Stoning a person representing the
ultimate rejection from the community.  In any case, I've
been designing the abstract API, and writing code to
implement it for the &lt;a
href="http://www.baytechdcd.com"&gt;BayTech&lt;/a&gt; &lt;a
href="http://www.baytechdcd.com/products/rpc5.shtml"&gt;RPC-5&lt;/a&gt;. 
&lt;p&gt;
Designing and implementing an API for heartbeat.  Heartbeat
is pretty nice in several ways, but it is limited in what it
can do.  It does heartbeats better than any other open
source product I know of, but doesn't integrate with other
applications to speak of.  The API will allow it to be
easily used with lots of other applications, whether with
FailSafe, or Piranha, or CXFS, or Kimberlite, or with
Stephen's new cluster manager, or some newly designed
cluster manager or whatever.  It is nearly complete, but
needs some minor redesign to eliminate certain security
issues from it before people start using it.  You can get
the code for this and the Stonith API from the Linux-HA CVS
repository.
&lt;p&gt;
Generally working on heartbeat.  Fixing it up, etc.
&lt;p&gt;
Strategizing on how SuSE should promote and package
Linux-HA.  Generally worrying about what should be done, and
puzzling over how to get it done.  This activity overlaps
with lmb.
&lt;p&gt;
&lt;b&gt;General Notes&lt;/b&gt;&lt;br&gt;
I just got a new user for heartbeat that I am &lt;i&gt;absolutely
sure&lt;/i&gt; will need some tech support this winter.  Heartbeat
is now running in Tahiti :-)
&lt;p&gt;
I just found out that a talk I gave back in April won an
award for the best talk of the day at the Lucent
Technologies Software Symposium.  That was certainly nice.
</description>
    </item>
  </channel>
</rss>
