31 Jul 2002 Bram   » (Master)

Stick a fork in it

BitTorrent's first mature release is up. There is also up-to-date documentation of the finalized protocol.


I'm going to be at Defcon this year. In addition to pumping BitTorrent, I'm going to have the CodeCon 2003 Call For Papers out.


Nilsimsa has an interesting problem of, given a bunch of bit vectors, figure out which ones have a 'very close' neigbor using the hamming distance (the number of bits in common).

I figured out the following very odd randomized algorithm for this -

First, we divide the nodes into n ** (1/5) buckets of equal size. We will find the nearest neigbor of each node in its own bucket and every other bucket, and the minimum of those is its nearest neigbor globally. (Assuming it has a very close nearest neigbor, this algorithm often misses nearest neigbors if none happen to be particularly close by.)

For each bucket, we build a data structure in which each node has a 'neigborhood' of nodes it's bilaterally connected to. We define 'walking' the data structure as the process of finding a likely nearest neigbor to node x by first starting at an arbitrary node and calculating the distince from all of that node's neigbors and their neigbors to x, then hopping to the nearest of those, and repeating until we can't get any closer. In practice, walks are never more than a few hops. This data structure can be built inductively by, for each new node, walking from random different existing nodes to the new one and adding that node to the neigbor list repeatedly until there probably aren't any more neigbors of the new node to be found. This will result in a neigborhood of size about the fourth root of the number of nodes in the bucket (determined empirically).

Note that walking is somewhat nondeterministic, and doesn't always return the best value. It's a good idea to walk two or three times, to make it unlikely that a very close neigbor will be overlooked.

I believe this algorithm uses memory n ** (6/5) and runtime n ** (8/5), which seems reasonably practical.

Memory used is (number of buckets) * sizeof(data structure per bucket) = n ** (1/5) * (n ** (4/5)) ** (5/4) == n ** (6/5)

Time used for building data structures is (number of buckets ) * (time used per bucket) == n ** (1/5) * (n ** (4/5)) ** (7/4) == n ** (8/5)

Time for building is computed by x ** (1/2) per walk, times x ** (1/4) for neigborhood size, times x for the number of nodes, hence the 7/4 exponent.

Time for finding neigbors is (number of nodes) * (number of buckets) * (walk time) == n * n ** (1/5) * n ** (2/5) == n ** (8/5)

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!