Quick File Distribution Challenge

Posted 4 Oct 2002 at 22:15 UTC by ncm Share This

Suppose you need to copy a gigabyte-sized file to a hundred other machines. How long should it take? On an optimal 100BaseT LAN, the naïve approach takes over two hours. What if you need to distribute several times a day?

It turns out you can do it in under two minutes with just a shellscript and common utilities -- i.e., no multicast/broadcast foolery, no hardware upgrades. Can you figure out how?

I will post the shellscript after everybody gets a chance to post their own ideas. The first to match the optimal solution gets co-inventor recognition (assuming this isn't really a familiar technique that I just haven't been able to find any reference to). If you post, please note whether you just figured it out for yourself or saw it used somewhere. Prior Art is good, these days.

This technique, BTW, is equally useful for copying ISO CD images or big tarballs to archive mirrors without jamming the pipes. (Integrating it into rsync ought to be pretty easy.)

tree / p2p distribution, posted 4 Oct 2002 at 22:41 UTC by splork » (Master)

Distributing to N different hosts? break the file into N similar sized chunks and send one chunk to each host along with a map of which host got which chunk and which part of the file that chunk is... now with coordinated communication between hosts they can talk amongst themselelves to get their missing parts (ie: each host should open up connections to the N-1 other hosts as soon as the originator says "go"; some random delay will greatly help network stacks deal with the sudden connection load)

(you could use the originating host to coordinate, ala bittorrent or pre determine all coordination at the start, handing out instructions with the original N chunks as directed above)

This assumes a switched full duplex 100mbit/sec network with a backplane on the network switch that can handle the full amount of traffic between all hosts...

this is nothing new in my mind. got to send X to N things, why transmit it N times over your own pipe when many of the N you're sending X to have pipes of their own that could be helping out. that's a fundamental idea of p2p data distribution.

rsync?, posted 4 Oct 2002 at 23:15 UTC by jstraw » (Journeyer)

all you need is ssh and rsync... the first time will take a little time, but every time past that is to send just the changes...

if it is a single file it will be even faster then if it is hundreds, cause it will only be 1 file to get a check sum on.

A stab, posted 4 Oct 2002 at 23:19 UTC by raph » (Master)

Ok, I'll take a stab at this. The solution isn't fully worked out.

The basic idea is to set up a binary tree. At each node, the script takes a list of hostnames. It invokes a left and right branch, each with approximately half the hostnames in the list.

After setting up the branches, you need to take the input, cat it to a file, and send the output to the two branches. Since you said "shell utilities", I'm thinking tail -f, but this doesn't really have the desired takeoff and landing semantics.

It's easier for me to imagine doing this with a real programming language rather than shell. In real life, I'd want to have a negotiation by which you contact a node and determine whether it's already participating. That way, a node can fail in isolation. Even more advanced is a "retry" mechanism in case a node goes down in the middle.

BitTorrent, posted 5 Oct 2002 at 01:13 UTC by Bram » (Master)

Fire up BitTorrent. All you need to run that is Python, and it works on any type of network, even if the machines are all on unreliable DSL lines.

Glad I didn't shower this morning..., posted 5 Oct 2002 at 03:49 UTC by kbob » (Master)

I was thinking about this for a while, then I decided to break for a shower. In the shower, a truly trivial solution presented itself.

Rather than post my solution and spoil it for everyone, I mailed some scripts (18 lines total) to ncm. Maybe he'll let us know whether we had the same idea.

Showering. Oops, I forgot to run naked through the town shouting, "Eureka!" (-:


How about a chain?, posted 5 Oct 2002 at 04:11 UTC by wmf » (Master)

Swarming and binary trees are fun, but a chain seems much simpler. If there are n hosts, any optimal solution will saturate the inbound link of n-1 hosts. The chain satisfies this (assuming you aren't CPU-limited), thus the fancy techniques won't do any better.

cat file | rsh host1 "tee file | rsh host2 \"tee file | rsh host3 ...\""
(I probably have the quoting wrong since I don't do much shell scripting.)

I have heard of this before, but I don't remember where.

Good response, posted 5 Oct 2002 at 04:18 UTC by ncm » (Master)

splork's answer is pretty good, and potentially optimal, but also a lot more complicated than need be. What would the shellscript look like?

Thanks for the pointer to BitTorrent. If I had known about that, I might not have discovered this. But I'm looking at a shellscript where all the real work is done in 10 non-comment lines. (BitTorrent is over 5K lines (with hardly any comments!), but it seems to solve a bigger problem.)

raph's concern about fault-tolerance is well-taken. Any solutions should work around failed nodes. Oh, and just to make it a little harder :-), you don't know how big the file is; you're reading from stdin (e.g. "tar czf - * |"). Writing input to a file and then reading it back to send it counts against your two minutes!

kbob got it. His solution is basically the same as mine, albeit rather more elegant. Thanks, Bob, for letting others have a go. (I, too, wanted to run around naked and shouting.)

Closer, posted 5 Oct 2002 at 04:36 UTC by ncm » (Master)

Wes's (wmf's) reasoning is admirably direct, and in fact my solution involves a "chaind" daemon. His construction isn't practical like kbob's was, though. (Bob, you might as well post yours.)

I'm hoping Wes (or somebody) will remember and post where he saw this before. Prior Art has got pretty important to keep track of.

How about a chain?, posted 5 Oct 2002 at 04:37 UTC by wmf » (Master)

Swarming and binary trees are fun, but a chain seems much simpler. If there are n hosts, any optimal solution will saturate the inbound link of n-1 hosts. The chain satisfies this (assuming you aren't CPU-limited), thus the fancy techniques won't do any better.

cat file | rsh host1 "tee file | rsh host2 \"tee file | rsh host3 ...\""
(I probably have the quoting wrong since I don't do much shell scripting.)

I have heard of this before, but I don't remember where.

Not bit torrent but ..., posted 5 Oct 2002 at 04:52 UTC by garym » (Master)

The use of BitTorrent goes outside the original challenge to use only shell scripts and typical unix utilities. The basic idea, though, and assuming you don't run into a network jam (ie you have a passive network hub, not a switch), is obviously a division of the problem into parallel transports.

My first-crack guess is yes, take the list and split it in N pieces, then fork N processes to take the first machine on each list and (a) send it the list subset and the script to do the transmissions (b) rsync the file to the recipient machine then (c) ssh to the receiving machine to run the transmission script.

I'm still puzzled: Each machine has only a capacity for emitting 10MB/sec, which, allowing for network effects, is about 1000/10 or two minutes per file copy transmitted times N files to transmit from each host; this is not the target of only 2 minutes for the entire 100 node process. At N=4, that's a split of 4->4->3 or 22 minutes (unless you can start the rsync send before the whole file is received) ... not even close to the two minute limit, so I won't be the least surprised, though, if there exists a totally obvious, far more efficient and blazingly elegant solution.

the time limit in 100Mb/sec (=10MB ie about 100 sec/GB) implies a cascade where every machine pulls only one file in a serial fashion A->B->C->... but doesn't that linear graph imply that B must both send and receive and thus only has 5MB/sec available for each direction? It's late and my math is probably way off.

This is a good puzzle.

Swarmcast too, posted 5 Oct 2002 at 05:07 UTC by garym » (Master)

The old swarmcast.sourceforge.net would also solve the robustness problem and had the potential to also account for weak network segments; Justin has moved on now as onionnetworks.com.

Full Duplex, posted 5 Oct 2002 at 05:16 UTC by ncm » (Master)

garym points out an interesting detail. Yes, in specifying an "optimal" 100BaseT network, I meant to imply full-duplex network cards throughout (along with good switches). Therefore, each host may be assumed to have 10MB/sec capacity in each direction.

Of course it's easy to find NICs that don't support full-duplex operation, or not by default. They may seem as fast as a real NIC if you only test them in one direction. For most uses you mightn't notice the difference. That's no reason to buy them, though.

My version of the solution, posted 5 Oct 2002 at 05:52 UTC by kbob » (Master)

Here it is, a complete application. I'll have to post to Freshmeat now. (-:


# invoke as: originator.sh recipient_hosts < bigfile
until [ -z $# ]
do  first="$1"
    rsh "$first" recipient.sh ${1+"$@"} && break


# invoked on a recipient host via rsh by originator.sh
if [ $# == 0 ]
    cat > bigfile
    tee bigfile | rsh $first recipient.sh "$@" || exec originator.sh "$@" < bigfile

This uses rsh, but it could and should use ssh.

It's untested. Probably contains serious errors that will become glaringly obvious as soon as I post it. (-:

This is a neat hack. Is it a major advance in the art? Not hardly. I have a feeling that it would be way too fragile to use in real life.


just use tee, posted 5 Oct 2002 at 10:55 UTC by walken » (Master)

I'd do somethink like


#! /bin/sh
next_peer=$1; shift
if x"$next_peer" != x""; then
  ssh next_peer "tee distrib_file | distrib.sh $@"

usage: cat distrib_file | distrib.sh peer1 peer2 peer3 peer4

peer1 uses tee to simultaneously store the file and distribute to the other peers.

data flow is a continuous stream, host->peer1->peer2->peer3->peer4, using the full available bandwidth between two peers.

I did not test it, but it should mostly work, I think.

Of course doing this means you're limited by the lowest-bandwidth peer, so it's still a good idea to use bittorrent instead (well, I've actually never used it myself, but I talked with Bram at a party few monthes ago and I thought he was really smart, so there)

damn, kbob beat me to it, posted 5 Oct 2002 at 10:58 UTC by walken » (Master)

and worse, I dont even really understand his solution.

Tee, posted 5 Oct 2002 at 14:15 UTC by garym » (Master)

tee ... yes, of course. How stupid of me. (the sound of one head whacking)

I still think, in a real network, you'll exceed 2 minutes, but I expect you'll beat my 22 minutes. I hope you'll let us know how the trial goes.

distrib and originator, posted 5 Oct 2002 at 14:22 UTC by garym » (Master)

  • If originaltor does the until loop, isn't it sending one file (one instance of recipient is run) for every host?
  • in distrib, which is wonderfully recursive and my best bet for success, shouldn't that be $* and not $@? Now it's not late, it's way too early to be thinking straight.

Next Phase, posted 5 Oct 2002 at 17:22 UTC by ncm » (Master)

From here on out I'd like to have posted solutions tested carefully. I suspect quoting anomalies in some of the code above.

The next winning entry will (besides bypassing downed hosts) ship itself to the target hosts along with the file and the host list, not assume it is already there. This should mostly be an exercise in extreme quoting and quining.

kbob's observation about fragility, and raph's about using a real language, are not to be ignored. The true expression of this technique probably should be a mode in rsync. Anybody care to code that?

p.s. people confused about kbob's solution should be careful to note the "&& break" at the end, which helps it to tolerate downed hosts. BTW, his construction ${1+"$@"}, he says, works around a bug in old /bin/sh. It may take a close reading of the man page to discover what it does. (In buggy shells, an empty "$@" evaluates as "".)

Does this work with Ethernet?, posted 6 Oct 2002 at 19:36 UTC by ciphergoth » (Journeyer)

Maybe I'm missing something important, but I would have thought that on a typical Ethernet network, you can only have one packet being broadcast at a time. If all hosts are on the same network and you don't use broadcast/multicast, you have to send every byte once for every host, so you have to send 100 gig, and having multiple hosts doing the sending doesn't help much. Are people assuming a different network topology?

Ethernet ? Sure, posted 6 Oct 2002 at 22:59 UTC by chbm » (Journeyer)

Given a good switch with plenty of backplane bandwith and full duplex links.

NetCat ?, posted 7 Oct 2002 at 00:20 UTC by Malx » (Journeyer)

search for NetCat

It is very usefull tool for network games :)
I have tested it. You could use it to send file via multicast UDPs.

cat file | nc -u -s (local interface IP) 5555

on other side -
nc -l -u -p 5555 -s > file

Netcat: Good call, sort of, posted 7 Oct 2002 at 04:55 UTC by ncm » (Master)

In fact, my own script does use netcat instead of rsh or ssh, but not to do multicasting. Multicast, by itself, isn't a reliable transport mechanism. Stuff layered on top can make it reliable, but that gets complicated. (Anyway the original challenge said no multicast, so that would be cheating.) It's true that multicast transport would reduce the total network load by half, which could be a good thing if you were using it for something else besides copying the one file.

Most IPV4 backbone routers filter out multicast traffic, so multicast is not a suitable distribution mechanism for updating mirror archives, as useful as it may be on a LAN. When we have IPV6 backbones (Speakeasy routers support IPV6 now!) then multicast will become a lot more useful, because an IPV6 router isn't allowed to filter out multicast traffic.

My script uses netcat mainly for its reliable failure modes and timeouts. The Debian version has a "-q" option, which makes it respond to EOF, that is lacking on most others. (Can anybody identify other distros that have netcat with "-q"?) The Debian version also doesn't blurt "Punt!" when it quits, another good thing.

Netcat appears to have been orphaned -- its last official release was in 1996 -- and its codebase is showing its age, depending on lots of nonportable compiler features. Somebody needs to adopt it, clean it up, and release a version 1.11 under a defensible license. I was surprised to find that its original author works in the building right next door to my office.

one ring to rule them all..., posted 7 Oct 2002 at 21:36 UTC by splork » (Master)

kbob's code seems to represent what I initally thought of when presented with this problem. It sets up the hosts in a ring fashion (anyone remember FDDI or token ring network layouts ;) such that each one takes the data it receives and saves a copy at the same time as passing it on to the next host down the line. impressive tiny shell implementation!

After having that thought i convinced myself that a ring was too pretty and having every host talk to every other host was more fun ;). I'm not about to implement that using a shell script. (though i believe with proper use of dd and netcat it could be tiny, though not robust)

I'll have to agree with raph that this is something best written in a real language to deal with errors, flakey hosts/network connections and the likes as well as doing about 10x less data copies on each host through pipes and other shellisms.

As for the multicast idea: that is the best solution on a local area network that won't consume the insane amount of network resources that the others will. however as multicast is unreliable you need to send extra data using forward error correction until all recipients have responded saying that they have the entire thing; that requires a custom app, not netcat. google for reliable multicast and dig around.

Parallel Transfers and FEC required to approach channel capacity., posted 8 Oct 2002 at 04:17 UTC by orasis » (Journeyer)

First, due to the throughput innefficiences of TCP in the face of high latency or packet loss over very high speed network, either a UDP or parallel TCP-based approach is required to fully utilize a single pipe.

Furthermore, the only known way to approach the theoretical transfer limits on a multi-path network is via the use of Forward Error Correction (FEC) via a swarming peer-to-peer protocol.

Swarming w/o the use of FEC may work well in practice, but suffers from the inneffiency that the utility of a random packet of data in the network goes down as a particular node retrieves more data.

For instance, if a file is split into 100 packets, and a node has received 99 of them, only 1 specific packet out of 99 is useful for reconstructing that file, and that 1 specific packet may not be available from an optimum node.

In the case of FEC, you expand the space of useful packets, with some implementations allowing an expansion factor up to 64000x. So, with an expansion factor of 100, lets assume we have 100 source packets, expanded to 10000 encoded packets. Now once we have received any combination of 99 packets, there are still 99901 useful packets available in the network, with one of them bound to be available from an optimum node.

This is the system that we more or less implemented in Swarmcast, though the code is now old and moldy. If you want to play around with FEC, get it here.

The new stuff that we're working on is called the "Content-Addressable Web", and you can find information about this at the Open Content Network.

One ethernet per machine?, posted 8 Oct 2002 at 14:40 UTC by abraham » (Master)

When you say it works on ethernet, am I right that you are actually talking about each host having its own ethernet, all connected on a super-switch?

So it would not work on _an_ ethernet.

Re: Parallel Transfers and FEC required to approach channel capacity, posted 8 Oct 2002 at 15:27 UTC by lukeg » (Master)

    First, due to the throughput innefficiences of TCP in the face of high latency or packet loss over very high speed network, either a UDP or parallel TCP-based approach is required to fully utilize a single pipe.

To dig up an old thread: which inefficiencies are you referring to?

Separately: does anyone know of a reliable multicast/broadcast implementation that can be downloaded and used, and would solve this file-distribution problem without the switch?

Ethernet, posted 8 Oct 2002 at 19:13 UTC by ncm » (Master)

The definition of "ethernet" has been changed by standards bodies over the years. All that's left of the original conception is found in the occasional hub in somebody's office serving as a collision domain. In an "optimal 100BaseT LAN" as found in the problem spec above, all the connections are full-duplex and those connections are to switches, and there are no collisions (really, simulated collisions or dropped frames) as long as total traffic to any endpoint is within the capacity of the channel, and the aggregate traffic is within the capacity of the switch.

What we use nowadays is switched star-topology LANs built of point-to-point links that use ethernet-style signaling. Collisions physically cannot occur, although frames may get dropped. Of course, constructing point-to-point links using ethernet hardware is terribly inefficient, but economies of scale make it both unavoidable and not correspondingly expensive. (Also, you still get to plug in a hub wherever you don't care about performance.)

If you can find a hundred-host webfarm connected to "an ethernet", chaining (like bittorrent and swarmcast) won't work well, and you may demand a full refund. Good luck.

Re: ethernet, posted 9 Oct 2002 at 15:11 UTC by lukeg » (Master)

The solution above is very neat for the problem as stated. I wonder how well it would hold up if we add some more "real-world" factors, just for fun?

For instance, do many switches actually behave "optimally"? I mean, if you try to push them to their theoretical limits by pumping data at full speed in both directions on each port, will real switches keep up? I'd be surprised if the cheapo one on my desk would, but I don't know.

Another factor is contention on the network - what if one of the machines is already doing something that uses up some of its bandwidth? With TCP, this machine will become a bottleneck for the whole chain: it can only deliver data to the next machine with the bandwidth it has, and its blocking-writes to the bottleneck will propagate the problem all the way back to the source via TCP flow-control. So the transfer rate between all pairs of machines will be reduced to the speed at the bottleneck.

NB: The bottleneck problem could be gotten around with a special version of 'tee' that reads and saves to the file as fast as possible, and separately does non-blocking writes to the pipe. But GNU tee doesn't behave like that, and if it did it'd presumably be limited to working with regular files (so it can later read-back the data for where the pipe is up to.)

For a really fast and robust solution, my money's on a fancy reliable broadcast/multicast implementation. I'm eager to see how it could be done, if any advogato hackers are into that sort of thing.

Clarification of Re: ethernet,, posted 9 Oct 2002 at 15:19 UTC by lukeg » (Master)

I should clarify what the special 'tee' would buy: it would prevent flow-control from propagating the bottleneck speed back to the source, but the bottleneck would still dictate the speed between all machines going forwards in the chain.

dump the file on DVD, posted 12 Oct 2002 at 13:42 UTC by sye » (Journeyer)

and shuffle DVDs to 100 machines. But my question is if all machines are all on the network, why do you want to have duplicated data source of over 1 G in size and over 100 in duplicated copies?

DVDs, posted 14 Oct 2002 at 19:32 UTC by ncm » (Master)

Just burning 100 DVDs (using 100 DVD burners? in parallel? connected to how many machines? how did they all get the copy to burn?) uses up your two-minute budget many times over, never mind humping the discs out to the server farm and loading them all in the little trays. Do you want to do that eight times a day, when you can do it quicker with just a script, and make cron run the script for you?

How much does the 100GB of disk space (1GB per server) cost? Incrementally, probably nothing -- the machines already have it. Even pro-rating it, it's way less than the cost of a hundred DVD drives.

Local networks have replaced sneakernet for sound economic reasons. Disk sharing remains defensible only for its secondary benefits (backup, fewer moving parts, lower power) and not absolute cost. If you want an alternative to copying the file to all the machines, installing a couple of gigabit ethernets and some file servers would be more practical. Then each server might run diskless, at lower power and with better reliability.

It's traditional in beginning network classes to compute the bandwidth of a station wagon full of tapes. (These days that would be a briefcase full of DVDs.) It's instructive to consider where you would be better off using DVDs: Bad bandwidth, static data, massive duplication, manual on-site mounting tolerable. E.g. the local video rental store. Is it any surprise that the DVD medium is tuned for movie distribution? What's surprising is how rarely conditions appropriate for DVD distribution are found otherwise.

how rarely platform gives control to DVD media?, posted 17 Oct 2002 at 19:19 UTC by sye » (Journeyer)

If there is a big red button on a wide range of hardware which says once this is pressed, DVD medium tray takes the control from all other resources. Then the table shall be turned. The problem is not that DVD rarely meets the general conditions of distribution but because DVD was created as a data media for movie industry. But if DVD has fourth bootloader on it, with device control bytes and bits and data stores. I can't see why vast space on DVD can't be made more valuable and more transparent than the fate of becoming mug pads.

Finally Got Permission, posted 5 Dec 2002 at 00:07 UTC by ncm » (Master)

I finally got permission to release my original solution to the problem under the GPL.

The send script is left as an exercise for the reader, but here's a fragment that takes the file from stdin:

{ echo "$filename"; cat; } |
   /usr/local/bin/nc -w 10 -q 10 "$remotehost" "$port"

The daemon below is less elegant than some of those above, but more robust against down nodes.

# chaind: save stdin to a file and pass it along to another host.
# Copyright 2002 by ITA Software, Inc.
# Released under the GNU Public License Version 2.
# Copy standard input to a file in /var/spool/chaind, and also to
# the next working host we find after our own hostname in the config
# file. Report number of bytes written, error messages, and any
# reports from subsequent hosts, to standard output. Read the
# file name as the first line from standard input, followed by the
# file contents.
# The config file lists hostnames. Each host opens a connection to
# the host listed after itself. Comments begin with a "#" character
# anywhere. Blank lines and whitespace are discarded.
# Dependencies:
# Assumes the utility called netcat, invoked as "nc", is on its path.
# Assumes that netcat supports "-q" (as implemented in the Debian version).
# Assumes a "chaind" is declared in /etc/services, e.g.:
# chaind 8998/tcp # chain file distributor
# Should have a userid named chaind.
# Needs a directory /var/spool/chaind, on a file system big enough
# to hold the copied file, and owned by user chaind.
# For inetd, needs a line in /etc/inetd.conf like
# chaind stream tcp nowait chaind /usr/local/sbin/chaind.sh
# (Note: inetd rereads its config when it gets SIGHUP.
# Try "kill -HUP $(</var/run/inetd.pid)".)
# For xinetd, needs a file placed (probably) in /etc/xinetd.d like:
# service chaind
# {
# server = /usr/local/bin/chaind
# socket_type = stream
# protocol = tcp
# user = chaind
# wait = no
# disable = no
# }
# (Note: xinetd rereads its config when it gets SIGUSR2.
# Try "kill -USR2 $(</var/run/xinetd.pid)".)

trap "" SIGPIPE


# ulimit -f 2000000000
umask 007

if ! which nc >/dev/null 2>/dev/null; then
  # grrrr

if [ ! -r "$configfile" ]; then
  echo "$hostname: cannot open hosts file $configfile"
  exit 1

tab=' ' # a real tab character, \t.
white=" $tab"

read filename
if [ $? -ne 0 ]; then
  exit 1
case filename in ( */.* | .*)
  echo "$hostname: bad file name $filename"
  exit 1

trap "rm -f $tmpfile" EXIT

{ # gather up stderr to pipe to stdout
  { echo "$filename"; tee "$tmpfile"; } | {
    # Find hosts listed next after us in the config file (skipping comments).
    for remote in $( \
      sed -n -e 's/#.*//' -e "s/^[$white]*//" -e "s/[$white]*$end//" \
             -e "/^$hostname$end/,$end{/^$hostname$end/d;p;}" \
      # pass along stdin, gather stdout
      PATH="$NEWPATH" /usr/local/bin/nc -w 10 -q 10 "$remote" "$port" && break
    cat >/dev/null
} 2>&1 | sed -e "s/^tee:/$hostname:/"

if [ -f "$tmpfile" ]; then
   mv "$tmpfile" "$outfile"
   trap "" EXIT
   echo "$hostname: wrote $(wc -c < $outfile | tr -d ' ') bytes."
   echo "$hostname: failed."
   exit 1
exit 0

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

Share this page