Older blog entries for Bram (starting at number 4)

31 Jul 2002 »

Stick a fork in it

BitTorrent's first mature release is up. There is also up-to-date documentation of the finalized protocol.

Defcon

I'm going to be at Defcon this year. In addition to pumping BitTorrent, I'm going to have the CodeCon 2003 Call For Papers out.

Nilsimsa

Nilsimsa has an interesting problem of, given a bunch of bit vectors, figure out which ones have a 'very close' neigbor using the hamming distance (the number of bits in common).

I figured out the following very odd randomized algorithm for this -

First, we divide the nodes into n ** (1/5) buckets of equal size. We will find the nearest neigbor of each node in its own bucket and every other bucket, and the minimum of those is its nearest neigbor globally. (Assuming it has a very close nearest neigbor, this algorithm often misses nearest neigbors if none happen to be particularly close by.)

For each bucket, we build a data structure in which each node has a 'neigborhood' of nodes it's bilaterally connected to. We define 'walking' the data structure as the process of finding a likely nearest neigbor to node x by first starting at an arbitrary node and calculating the distince from all of that node's neigbors and their neigbors to x, then hopping to the nearest of those, and repeating until we can't get any closer. In practice, walks are never more than a few hops. This data structure can be built inductively by, for each new node, walking from random different existing nodes to the new one and adding that node to the neigbor list repeatedly until there probably aren't any more neigbors of the new node to be found. This will result in a neigborhood of size about the fourth root of the number of nodes in the bucket (determined empirically).

Note that walking is somewhat nondeterministic, and doesn't always return the best value. It's a good idea to walk two or three times, to make it unlikely that a very close neigbor will be overlooked.

I believe this algorithm uses memory n ** (6/5) and runtime n ** (8/5), which seems reasonably practical.

Memory used is (number of buckets) * sizeof(data structure per bucket) = n ** (1/5) * (n ** (4/5)) ** (5/4) == n ** (6/5)

Time used for building data structures is (number of buckets ) * (time used per bucket) == n ** (1/5) * (n ** (4/5)) ** (7/4) == n ** (8/5)

Time for building is computed by x ** (1/2) per walk, times x ** (1/4) for neigborhood size, times x for the number of nodes, hence the 7/4 exponent.

Time for finding neigbors is (number of nodes) * (number of buckets) * (walk time) == n * n ** (1/5) * n ** (2/5) == n ** (8/5)

26 Jul 2002 »

Load Testing

BitTorrent continues to shrug off every load test we throw at it. I thought a 700 meg file would get more concurrent downloaders, but instead it's just resulted in more people cancelling before download completes. The 140 meg file had an 80% completion rate, the new one is more like 35%.

I'd really like to get a load test with 200+ concurrent downloaders since that makes the connectivity graph more sparse than densely connected, but it's looking very difficult to make that happen. Earlier versions got that kind of load and didn't have artifacts related to large networks, so I may just give up and declare it final without another big deployment. The network is a manually constructed to be random, rather than following a power law, so it's unlikely for anything weird to happen. Still, another test would be nice.

In any case, recent tests have gone unbelievably smoothly. The only artifact of note is that when the load suddenly jumps two orders of magnitude it takes a few minutes to get warmed up, which is hardly alarming or even surprising.

Diary Ratings

Current diary ratings seem to conflate two concepts - your estimated quality of a diary, and your confidence in that estimation. That said, I'm impressed with how well the numbers reflect my own own opinions.

Diaries which talk about the author's life seem to rate worse (at least for me) than ones which emphasize technical topics. I wonder if this is a stylistic difference of opinion, and whether other peoples's estimated values have the same phenomenon or possibly even the opposite one.

Ugly Hacks

wardv just got me to throw up on my keyboard. That's more a possible new source of security problems than a useful technique. I wonder if parseargs has buffer overflows.

Picking on Wolfram

Wolfram presents a turing machine which he claims is universal because it emulates rule 110. That assertion is just plain wrong. The turing machine in question doesn't halt in response to a computation ending, it merely stabilizes, a clear violation of the ground rules for how turing machine encodings work. It also requires an infinite pattern on the starting tape, rather than a finite number of cells representing the problem encoding and the rest in a uniform 'blank' state. This is an equally clear, although less egregious, violation.

22 Jul 2002 »

Dillo

Dillo is a lightning fast, well written browser with a small, maintainable codebase. Merit, sadly, is not the primary determiner of funding.

Free Manure

Mozilla and FreeS/WAN are both projects which have by their existence discouraged people from working on alternatives but barely shipped themselves. Criticism can't be brushed off with the excuse 'I'm not getting paid for this, buzz off'. Often we'd be better off without free contributions.

People are too politic about starting or contributing to new free software projects which compete with an existing one. Leadership incompetence and unpleasantness are valid reasons for starting an alternative, and we shouldn't be afraid of giving the real reasons for such splits. Pretending problems don't exist doesn't make them better, and the nice part of not getting paid is the emperor can't fire you for criticising his wardrobe.

Load Testing

BitTorrent shrugged off another load test. We need to try distributing a gigabyte-size file.

Only four lines of code had to be changed after this deployment. It looks like 3.0 is really going to happen.

18 Jul 2002 (updated 7 Jul 2003 at 16:27 UTC) »

A New Kind Of Plagiarism

Wolfram sued to keep the proof, written by someone else, of the keystone result in his book from being published before the book came out!

I think Wolfram's understanding of universality is weak. He completely glosses over the issues of encoding, which would be excusable for the purposes of exposition, but he commits a horrible technical gaffe created by that confusion. Specifically, he claims that because rule 110 is universal, there must be a straightforward encoding in it to compute a single generation of any of the other cellular automata. This is simply untrue, given the very non-straightforward encoding used in the proof of universality. Whether there is a proof of universality which uses a straightforward encoding is unclear.

Wolfram is clearly violating the spirit, if not the letter, of the rules of attribution, and his innovative use of the legal system to do so has chilling consequences.

New Axioms

I think this posting says there is now a known theorem which can be expressed using basic arithmetic but requires an axiom even stronger than basic set theory to prove. Fascinating.

Pipelining

Jeff Darcy correctly reconstructs our earlier conversation about pipelining. Pipelining is an important concept in networking, so I'll explain it here.

Let's say you have some data being transferred over a TCP connection in pieces. The simplest way is for the sending side spew everything, but that leaves no control on the receiving end other than to close the connection completely. In many protocols, the receiving side must explicitly ask for a piece before it is sent. The most straightforward implementation of this is for the receiving side to request a single piece, and thereafter request a new piece every time the last requested one arrives.

Unfortunately, that approach causes the sending side to pause after every piece sent, thus reducing throughput. Turning off Nagle will keep the congestion control damage from being too bad, but it still won't get rid of the pause. The Right Way is for the receiving side to send out requests for several things, then send out an additional request every time a piece comes in. This technique is called pipelining, and it ensures that the sending side always has something in queue.

Distributed Proofs

Some notes in my ongoing discussion of automated proof checkers with Raph -

Math papers generally contain coherent self-contained proofs, which result in one or more theorems. The self-contained part of the proof is straightforward, the subtlety comes in with what the proof imports and exports.

Exported theorems should be separately referenceable, since there is often more than one proof of a theorem.

I'm not sure whether an axiomatic system, such as peano's axioms, needs to be handled any differently from a theroem. Certainly, peano arithmetic can be constructed in a multitude of ways using ZFC. Whether to believe a theorem on its face, or require a proof based on something else, should be left up to the client proof checker, which is free to believe whatever it wants.

Namespacing of imported theorems is a subtle issue. Clearly clients must notice when a proof tries to use theorems about commutative groups on axioms which don't state commutativity. Also, it is important to be able to combine theorems which are both based on the axioms of arithmetic, but refer to cosmetically different statements of them. Requiring everyone to use the same encodings would likely prove unworkable in practice.

17 Jul 2002 »

Distributed Proofs

As Raph mentioned, we've discussed how to build a distributed proof system. I believe this has potential not just to be an interesting experiment, but a tool which changes how working mathematicians go about publishing their work.

A friend of mine described the way you write formal math papers is you take a proof, remove all the insight, and write that up. Mathematicians are rightfully paranoid about human error even with extensive peer review, and everything is designed around avoiding it. A practical formal proof checker would completely automate this onerous task, and allow for math papers to be insightful, perhaps even poetic, commentary on the concepts involved in a proof, with a link to the completely formal machine-checkable version.

Which is all a great vision, but to implement it I first have to understand how the universal algorithm works.

TCP

Jeff Darcy has another entry about TCP. I feel obligated to defend TCP, since it's what BitTorrent uses, with great effect. Granted, BitTorrent is doing high throughput, high latency bulk data transfers, which is exactly what TCP was designed for.

Jeff points out a very interesting article, and uses it as an example of problems in TCP. My reading of the article is very different. It's actually talking about the difficulties in congestion control and how TCP is handling them.

In a nutshell, the current implementation of TCP (Reno) is a bit brutish when it comes to doing backoff, so multiple transfers happening over the same net link have to leave a fair amount of unutilized bandwidth as safety margin to not step on each others's toes. A new implementation (Vegas) is much less brutish, making it waste much less capacity as a safety margin. The problem is that when mixing Reno and Vegas, Vegas gets much less transfer rate, because it's more polite. How to phase out Reno and phase in Vegas is a difficult problem, possibly an impossible one.

Writing your own congestion control won't solve these issues. If you even manage to write something less brutish than Reno (doubtful, a lot of work has been put into it) it will have the same meekness problems as Vegas. Any research into how to make congestion control gentle but firm is cutting edge stuff, and if you're really interested in it you should be working on TCP itself, not wasting time on some even more experimental new protocol.

So if you want congestion control, TCP is the only way to go. If you want both congestion control and low latency, I recommend using more than one connection, with the second one probably based on UDP.

I have to tease Jeff a bit for having suggested that I fix a transfer rate problem I had by turning off Nagle. I instead implemented pipelining, and the problem immediately disappeared. If a monkey is having trouble learning a trick, brain surgery is usually the last resort.

Cellular Automata

I initially thought that all linear cellular automata whose output is 'messy' must be universal, but given that Presburger arithmetic is decidable, I'm not so sure.

Jun	JUN	Jul
	30
2006	2017	2018