19 Jan 2004 error27   » (Journeyer)

I wrote some neat code last week for my unreleased Vantu p2p project.

The vision is that eventually, massive databases of Creative Commons media will be created. These databases will have an md5 hash of the best/official version of the media and p2p users will use this hash to specify that they want the official version.

The problem is, users can't know until the end of the download whether the md5 hash matches.

Instead of taking the md5 hash of the whole thing, Vantu takes the md5 hash of every 4k chunk. These hashes are grouped into pairs and we take the hash of each pair. We take the resulting hashes, group them, hash the group. This results in another set of hashes half the size of the first. We repeat this until only one hash remains. The final hash is the file signature.

At the application level, it rarely makes sense to use 4k hashes. An application could use 4k, 8k, 16k, 32k, or 64k etc chunks sizes. Vantu will probably only use 64k chunks. When the user wants to download a file the program asks the sharer for the hashes of the 64k chunks. We can check whether these hashes are correct. We can also check the data in each chunk as we download it. This way if someone is sharing bogus data we can find out after the first chunk rather than waiting for the entire download to complete.

It would be awesome if Vantu style file signatures caught on in other applications as well.

Here is the code to generate file signatures.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!