The NNTP protocol, when it was first published (in RFC
977) was quite a striking advance in Internet protocols. News
articles propagated in an entirely distributed, decentralized fashion.
The design of the protocol saved bandwidth when there were lots of
people at a site reading the same news articles, and also tolerated
failure of individual nodes. With these advantages, NNTP became one of
the most popular Internet protocols (along with email and FTP), and
for a time was the mechanism of choice for participating in Internet
"communities."
Now, 14 years later, NNTP is a protocol in serious trouble. Its
complete openness and lack of centralized control made it vulnerable
to spam, abuse, and other nastiness. While I used to "read news"
almost every day, the quality has sunk to the point where it's just
not worth it.
NNTP's popularity has been largely displaced by Web-based message
boards. These systems have many serious disadvantages compared with
NNTP, including: much poorer utilization of bandwidth, the forced use
of clunky "one size fits" HTML-based interfaces, and vulnerability to
the failure or compromize of the individual sites that host the
content. Nonetheless people do find them significantly more useful on
balance.
In spite of the mass migration from Usenet to the Web, NNTP does
maintain a few strongholds, especially pr0n, mp3z, and warez. In
addition, a number of new protocols with peer-to-peer transmission of
files are coming down the pike, of which Napster is surely the most popular.
A lot of people seem to be working on variations of Napster, including
the Gnutella project
started by a couple of employees at Nullsoft (the people who make WinAmp) and promptly shut
down by their corporate overlords at AOL.
Since the issues are disparate, we'll look at them one at a time, by
category.
Bandwidth
One of the main technical issues of all these protocols is bandwidth.
In the classical setup, your school has an NNTP server that talks over
the outside network to a few other NNTP servers. Within the school,
the clients talk to the news server over a very fast local network.
NNTP doesn't have any concept of loading files remotely only on
demand, so the total bandwidth tradeoff depends on the average number
of people accessing any one file (average, in this case, being mean
weighted by file size). When this number is above 1, you win. When
it's a lot more than 1, you win big.
With the Web, you basically load files directly from the central
server to the clients (this is certainly how mp3.com works). In some cases,
especially when there's a local network with a lot more load than the
connection to the Internet really can support, it makes sense to add a
caching proxy server such as Squid. Web caching has its own
set of issues, though, and basically doesn't work well unless the
server cooperates.
What's possible, of course, is a hybrid that is optimized for the
heterogeneous networks common in schools and companies. Basically,
you need the protocol to be sensitive to the relative capacities of
the networks, and try to share files between multiple clients inside
the local network, rather than duplicating their transfer from the
outside. This is one of the goals of Gnutella, and it's easy to
imagine that it will continue to be an area of active work on the part
of distributed protocols.
Note that it is impossible to optimize bandwidth in this way in a
fully centralized protocol. Also note that in some networking
environments, such as people connected through a cable modem or ADSL
line, the optimization doesn't do much.
Complexity
Let's face it, centralized systems are easier to manage and deploy
than distributed ones. In a distributed protocol, you have to worry
about consistency of namespaces, make sure the propagation algorithm
works properly, and deal with things like partition of the network,
failure of remote peers, and so on. Many of these problems pretty much
go away in a centralized system.
Further, to really take advantage of the added robustness
possible in a fully decentralized system, you need client support to
browse different servers and select the one with the best
availability. This is quite a bit harder than just doing a DNS lookup
on the server's domain name, then connecting on a socket.
Yet, the added complexity shouldn't be overwhelming. NNTP, after all,
has lots of clients and servers by now.
Control
Here, I think, is the crux of the distinction between centralized and
distributed protocols. In a centralized system, there is a single
point of control for things like controlling access, blocking and
removing spam, etc. In a distributed system, this kind of control is
difficult or impossible.
The lack of controllability is both a good thing and a bad. While
nobody likes spam and other forms of abuse, the anarchic nature of the
Internet is one of its more appealing features. In particular,
decentralized systems seem to be particularly resistant to censorship,
both blatant and the more subtle forms resulting from economic
pressures.
From a censorship point of view, content lies on a spectrum from
official propaganda and corporate-sponsored messages to flatly illegal
stuff, with a lot of the interesting stuff in between. Thus, it's not
surprising to see that a lot of the less "official" stuff, such as
copyrighted music, pr0n, and warez, gravitate to the more
decentralized forms, while e-commerce takes place entirely with
centralized servers.
Note that censorship and resource utilization have been linked for a
long time. Schools all over the world are now banning Napster because
of the intense network utilization. Back in the good old days, the
protocol of choice for warez and similar stuff was FSP,
which had the major property that it degraded gracefully under load,
simply throttling the transfer speed rather than killing the network.
What next?
The success of Napster is fueling a renaissance in distributed
protocols for file distribution. While a lot of the development is
currently ad hoc, it should be possible to learn from the successes
and failures of systems which have gone before, and systematically
design new stuff that works pretty well.
In the 14 years since NNTP was specified, a number of techniques have
come to light which can help fix some of its limitations. These
include:
- Protocols such as rsync and xdelta for
synchronizing remote systems.
- The use of hashes to define a collision-free global namespace.
- Public key cryptography, particularly digital signatures for
authentication.
- Systems such as PolicyMaker and KeyNote for
implementing policies.
- A ton of academic research on special problems within distributed
systems.
Further, there are a bunch of exciting new things that might just nail
the spam and abuse problems that seem to be endemic to distributed
communications. This includes the existing work from people such as SpamCop and NoCeM, as well as the trust
metric work being pioneered on this very website.
Advogato modestly predicts a renaissance in distributed protocols. The
next few years seem like a very exciting time for new work in this
area.
An interesting distinction to make in system design is the difference
between "distributed" and "decentralized". It's useful to reserve the
word "distributed" to talk about the fact of moving bits from place to
place. Pretty much any system on the Internet is distributed. The
question is how they're distributed.
Some systems on the Internet are fully decentralized - the Web is the
premier one. Some systems are centralized, such as a single Web
discussion board. In between are hierarchical systems: DNS falls in
this category, where there is a single tree of authority but plenty of
caching along the way.
NNTP and the current Interent backbone architecture both fall in a
different category. Neither system is fully decentralized: there's
still a strong tree shape to the network, where leaf sites get feeds
from upstream. However, neither system is fully hierarchical either:
at the highest levels, traffic is mutually peered and shared between
sites, there is no root authority like InterNIC.
Each type of design - centralized, hierarchical, semi-hierarchical, or
fully decentralized - has its advantages. Centralized is the easiest
to understand, but the least scalable and the least fault tolerant.
Hierarchical has done well - the success of DNS over the last 20 years
is nothing short of phenomenal. But hierarchical implies a root
monopoly, and we've seen those disappear over time with things like
the current Internet route peering architecture.
The thing that's less clear is fully decentralized systems. It works
very well on the Web, but only because we have full text search
engines to knit things together. I think the most exciting area of
future Internet research lies in this regime. The payoff could be
huge, building truly scalable and self-healing systems. But the
complexity is very difficult to manage.
The World Wide Web wasn't the only contender for a successful
decentralised
hypertext system. The HyperGratz system from Austria was (is?) another,
and
for a while was more widely deployed.
HyperG/HyperGratz had a distribution and cache mechanism that was
vastly more efficient than the WWW. It also had the idea that to run a
server,
you simply filled in a form and applied to someone in Austria to be
added,
and they'd tell you where your content fit into a global hierarchy.
This sort of beaurocratic centralisation is probably as "unAmerican"
as
you can get. I've portrayed it a little brutally to try and make clear
how it might
sound in North America.
The political advantage of WWW is that anyone can set up a server
right
away, with no need to interact with anyone. You can do that with netnews
too, both with NNTP and with the older uucp transport
and B-bnews distribution methods. The technical advantage is
simplicity.
Another distinguishing facter is the document life cycle. Usenet
articles last
anywhere from hours to weeks; web pages last from hours to years, or to
years
after they are out of date. AIM messages last seconds, and unless
someone saves
a log, IRC messages last a few minutes, or the length of your
scrollbar.
Bandwidth is less important in IRC than minimising interruption of
service,
for example (which is why the minimum spanning tree routing is
inappropriate), whereas a two hour interruption in a Usenet feed might
not be noticed.
The technology and the politics have to work together, and have to be
appropriate for the content, the users and the way the content is
used.
One of the important areas of concern between distributed and centrally managed systems is, as was mentioned by jwz, anonymity
and privacy versus control and responsibility. Within any centrally managed system, it is a trivial exercise to implement safeguards to
limit abuse. The draw back to that is the inherent limit this would impose on privacy. The users have little choice beyond trusting the
central authority will not do "Bad Things" with the information they track. Unfortunately, there are numerous examples of companies
that
will sell all the personal information they can find, because someone is willing to buy and subsequently abuse it.
The picture within a truly distributed system is actually somewhat worse (IMHO). With no controls on user activity because of
complete anonymity, there is no longer a limit on the irresponsibility of the users. I might choose to steal a "respectable" online identity
and post bogus stories to the effect that VALinux is about to report losses triple analyst's estimates, just to see what happens to the
stock price. In a completely anonymous internet, who could stop me? Only I can, assuming I have some sense of ethics that
identifies
such behaviour as unacceptable.
I think we've all seen what happens within such a large community where the only restraints are personally supplied ethics - the
level of online crime/abuse grows daily. The problem is that the concerns about abuse of personal privacy are just as valid. It's a
question
of who do you trust with what - the large organizations with some amount of information or individuals with some amount of anonymity.
In
the real world, this issue is dodged with the concept of "reasonable limits", both on an individual's right to privacy and an organization's
ability to compromise that privacy. It is an offence, to use an example, for law enforcement to listen to your phone conversations unless
there is compelling evidence that you have committed a crime and the only way to legally prove that crime is through a wire-tap. In that
case, a tap warrant can be issued at which point you lose the element of privacy you believe you have with your phone conversations.
The reasonable limits are on law enforcement to prevent trolling for criminals with taps and on your privacy in the face of evidence of a
crime. How many people have heard of cases where this law has been abused? This type of activity works only because there is a
limit
on the level of anonymity achievable in the "real world" as opposed to the "electronic world" - I might not know who you are, but I can
remember you face, so you can still be identified. There is no comparable limit in cyberspace at this time.
I have yet to see a reasonable response to this set of problems from either online communities or governments. There has to be
a
middle ground between reasonable privacy and reasonable control to limit abuse (abuse can only be eliminated completely when privacy
is eliminated completely, and even in that case, it's a questionable call). I honestly don't know what it is, but I hope more discussion
and
possibly some undreamt of technology will get us there. The answer is, to the best of my ability to guess, going to require some option
between distributed and centrally managed systems. Maybe community managed nodes within a distributed system? It comes back
to
who can you trust with what.