9 May 2008 apenwarr   » (Master)

2008-05-09: Great moments in probability

<!-- start of entry 200805/09 --> Great moments in probability

Years ago (around 1999) when dcoombs and I were debugging the first versions of our "weaver" Linux-based server appliances from our apartment in Waterloo, we used to test on the cheapest hardware we could obtain for cheap.

One of these boxes absolutely refused to boot weaver, but the symptoms were strange. We had three ways of booting: boot from a CD, install an image on the hard drive and boot that, or load Etherboot from a floppy and use that to network-boot the kernel over tftp.

The symptoms were as follows:

  1. Booting from CD worked fine.
  2. Installing from CD to the hard drive and booting that worked fine.
  3. Booting a weaver image from the hard drive (with a kernel downloaded via ftp) always gave kernel decompression error.
  4. The etherboot TFTP process would always abort with a timeout after a few packets. (Etherboot of the era would do that occasionally even on a good day, but here it happened every time.)

The obvious conclusion here was that our weaver kernel image was broken, because you could boot the Debian kernel from either CD or hard disk without a problem. Right?

Well... as it turned out, no. The actual problem was a horribly broken network card that would randomly corrupt bits. About 9 out of 10 packets would be corrupted. You'd think that would be obvious, right?

Well, no. In fact, TCP/IP is specially designed to deal with the occasional corrupted packet. TCP and UDP have a 16-bit checksum on every packet, and if it doesn't match, the receiver simply throws the packet away; the sender is supposed to resend (and it does!).

I had noticed the FTP transfers were surprisingly slow, but not *that* slow, and back in those days, you could never quite remember if your network card was 10 MBit or 100 MBit. This happened to be a 100 MBit card, but 9/10 packets were getting thrown away, so we got around 10 MBit performance from ftp.

But here's what killed us: a 16-bit checksum can only detect 65535 out of 65536 possible errors. A 9/10 error rate means you're sending 10x as much data as you think you are, so a 12MB kernel+rootdisk package is actually about 120MB of packets; that is, about 80000 packets at 1500 bytes each. Thus, virtually every transfer was destined to have a tiny number of incorrect bytes! Ha!

Of course TFTP is extra dumb and doesn't deal well at all with packet loss, so it would just time out. But I remain very impressed at how well TCP managed to paper over a 90% broken network. That's the power of the Internet for you, right there.

(Thanks to jwz for having a hopefully-unrelated problem that reminded me of this.) <!-- end of entry 200805/09 -->

Syndicated 2008-05-07 18:22:52 from apenwarr - Business is Programming

Latest blog entries     Older blog entries

New Advogato Features

FOAF updates: Trust rankings are now exported, making the data available to other users and websites. An external FOAF URI has been added, allowing users to link to an additional FOAF file.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!