11 Mar 2001 alanr   » (Master)

I've been testing some Weird stuff. Got a bug report from a guy from Dell that it dies after a day - right about the time the code decides to print all the processes' memory stats. He says it happens every time.

He's running with stonith enabled (which I hardly ever do, it's hard on the hardware). I can reproduce his problem quite readily - I sent the mem stat interval down to 5 minutes, and boom every 5 minutes it dies.

It needs stonith for it to fail. I removed the stonith option and it didn't fail.

I can also make it happen when I send it a SIGUSR2.

It looks like if I send a SIGUSR2 to the highest process id, then it prints out the memory stats and dies -- or just dies...

I need to look at the config file when it comes back up... I had 8 processes, one of them a zombie on sgi2. I'm not sure how many the config wanted...

The config had two links: one serial and one udp. The processes we create are:

	control process (parent process: runs last)
	write process
	read process
	write process
	read process
	master status process

But, I had 8 processes at that time...

There seems to be something wrong here ;-)

Here's the pids from the logs:

310	Prints "configuration validated"
312	prints "udp heartbeat started on..."
320	Still running... Locked...
321	Still running...
322	Still running... Locked...
322	Still running... Locked...
323	Still running... NOT Locked...
324	prints local status now set to... and Heartbeat restart
	prints "link sgi2:eth0 up"  Still running LOCKED
	prints "resource acquisition completed (none)"
	prints "Link sgi1:eth0 dead"
	prints "mach_down takeover complete"

644 defunct... prints "resource acquisition completed"

697 Control process... 697 Also prints heartbeat restart on... (?) 697 Also prints "link sgi2:eth0 up" 684 Control process prints "starting serial heartbeat on ..." 693 HBWRITE 694 HBREAD 696 HBWRITE 696 HBREAD 697: writes all the messages ;-) Master status process. prints and Heartbeat restart on... but only for local node

Found a fork in initiate_reset() with no exit at the end... and in an error leg in req_our_resources(), and giveup_resources()

This appears to have been the bug. After replacing the implicit return with an explicit exit, and doing so in a couple of other places in some funky error legs, I can't reproduce the problem any more.

I had also gotten a bug report that the multicast option-parsing code didn't work. I had "broken" it by fixing the ppp-udp code. However, my change was correct, and the multicast parsing code was incorrect. So, I fixed the multicast parsing code. I discovered in the process that even with the bug fix in, that it didn't work because the install process didn't install the mcast code. So, now I have both the mcast code working and this bizarre Stonith bug fixed. I've been running this test configuration with multicast (which I'd never tested before), and the stonith fix (but stonith turned off, because I suspect that the test code won't deal well with the machines getting rebooted each time they leave the cluster). Guess I ought to run a hundred iterations or so of that. (and fix the test code if it's broken). Robert_Macaulay@Dell.com (the original bug reporter) is currently setting it up for testing on his machines. It seems pretty likely that it'll work just fine for him.

I ran 1000 iterations of the test code. The final results are: 2001/03/11_14:23:38 Running test Restart [1000]
2001/03/11_14:24:26 Stopping Cluster Manager on all nodes
2001/03/11_14:24:31 ****************
2001/03/11_14:24:31 Overall Results:{'BadNews': 0, 'success': 1000, 'failure': 0}
2001/03/11_14:24:31 ****************
2001/03/11_14:24:31 Detailed Results
2001/03/11_14:24:31 Test Restart:{'success': 524, 'WasStopped': 156, 'node:sgi1': 253, 'calls': 524, 'node:sgi2': 271, 'skipped': 0, 'failure': 0, 'auditfail':0}
2001/03/11_14:24:31 Test flip:{'down->up': 160, 'up->down': 316, 'success': 476, 'started': 160, 'calls': 476, 'stopped': 316, 'skipped': 0, 'failure': 0, 'auditfail': 0}
2001/03/11_14:24:31 <<<<<<<<<<<<<<<< TESTS COMPLETED

435.94user 75.19system 13:28:48elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (6605568major+2597838minor)pagefaults 0swaps

This is great news. I need to run another set of 1000, and then some other tests (probably involving the stonith_host option), and then we'll declare it stable I think. Many thanks to Aaron Nienhuis and Robert Macaulay for finding these bugs and saving our users from finding them in a "stable" release.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!