Distributed Search Engines

Posted 11 Aug 2000 at 04:26 UTC by billgr Share This

Wrestling good results out of search engines is as common today as waiting for pages to download. Is it too much to ask for searches to be complete and accurate? Is the centralized nature of search engine databases part of the problem?

In the old days, before the web alone cleared a billion pages, Archie or Veronica (and later Lycos) were pretty good at giving up-to-date, complete results for their domains (FTP file titles, gopher resources, and the early web). As the Internet grew in size and complexity, search engines got stale results, incomplete coverage, and results that were drowned out in noise. Even Google, which in my opinion is the best current option, struggles with these issues.

How much of this weakness is related to the use of centralized search databases? A centralized system (even with several hubs) has severely limited bandwidth compared to the size of the net. It can only crawl so fast. Another problem is dealing with structured content. Complex websites may have database records instead of HTML pages as their basic unit, but this is lost on search engines, which then get stale results. And of course, robots.txt barriers are insurmountable for those sites that want to control their own entry points. Not insignificantly, centralized repositories which channel huge quantities of traffic form important targets for search engine spam---pages designed specifically to be misleading to search engines.

An alternative (which would certainly have its own problems) would be a distributed search results model. This network would be structured something like DNS, with local ISPs and big data providers running their own servers, whose results get cached by use frequency in other search servers around the world. This would alleviate some of the problems with a distributed model:

  1. Shared bandwidth: It is fast to crawl the LAN.

  2. Know your own content. Specialized URIs which index the true data objects of a given site; entry controls can be established if desired.

  3. Lower stakes. Spamming 10 big search engines is worthwhile. Spamming 10 little search engines (out of 10,000) is useless. Also, the local net can be better policed by people who care deeply about their reputation with other search services.

There are big problems to be addressed, of course. How would routing and caching of search queries and results work? How would credibility be maintained and bogus search servers excluded? Who would be responsible for maintainance on all the distributed servers?

Here are some preliminary possibilities. Something like the well-tested DNS-like network could be used for routing. Obviously much larger caches would be needed, and it would be a much bigger affair to set up a search results server than a DNS node. For maintaining credibility, something like the Usenet 2 system, or an Advogato-inspired trust network could maintain the integrity of the core system. Maintenance could be enforced by the same system. Presumably, local ISPs and big data centers would be eager to maintain their high-credibility rankings and so maintain their links and enforce relevance by pulling the plug on spammers, whose access they control.

most of the code is written already, posted 11 Aug 2000 at 06:37 UTC by doobee » (Journeyer)

I have read severel articles about distributed searching in the last few days. They are all in spirit of the p2p hype.
Most of the functionality the authors are thinking of is already inplemented in real code (TM). From the Harvest site:

The Harvest indexer offers a distributed solution to the problems of indexing data made available on the web. With each web server running a local Gatherer feeding into a central Broker many of the problems of web crawling are avoided. The Harvest Indexer can fetch and index data made available by HTTP, Gopher, FTP, or NNTP. It has summarisers capable of indexing data in a wide variety of file formats.
To most people harvests offspring, the distributed webcache squid, is better known than its father.
Harvest already supports distributing load and knowledge between gatherers and brokers. It for sure needs some polishing but it can be used very easiely as a groundwork for distributed searching which will work tomorrow, not next year

BTW: I wonder why nobody took the GPLed ftpsearch engine to build something cool.

Meta, posted 11 Aug 2000 at 13:40 UTC by xach » (Master)

Please, please, please avoid this style of random bold text. It makes the article quite difficult to read.

Getting good search results is the result of good searches, posted 11 Aug 2000 at 19:13 UTC by stripling » (Apprentice)

Most people seem to type in a query without giving it careful thought and then get several thousand "answers," look at the top 10, then move on to another search engine. People don't use the advanced features of the usual search engines to do boolean searches, and people don't use the unusual search engines, such as InvisibleWeb, AlphaSearch, Northern Lights, Complete Planet, and many others which provide free information not ordinarily available to web crawlers.

Having a Napsterized search engine may just provide another mishmash of thousands of results. Some people use robots.txt to avoid having their server overwhelmed by the hundreds of spiders now crawling the net, not just to keep their data "secret." My small Web site had 3,751 page requests for the first three weeks of July, and 1,371 of those were for robots.txt. My big web site served over 850,000 files with 1,016 requests for robots.txt. It's the small servers that get hurt by crawlers, not the big ones. I'm not sure that adding ubiquitous crawlers helps me on my small site at all.

I would rather people learn to formulate a search request and fine tune it to eliminate more and more spurious answers from Google or Alta Vista rather than generate viral spiders. I would rather people use more specifically helpful search engines tuned to their search requests than uncontrolled spiders that ignore robots.txt in the hope of ferreting out "secret" information.

Phil

Harvest engine, posted 11 Aug 2000 at 22:19 UTC by billgr » (Journeyer)

doobee, Thanks for the link! I did a search (of course) while thinking about this, but (oddly enough. :-) Harvest doesn't show up anywhere near the top of the half-dozen queries I tried. Another quote from the docs:

Our measurements indicate that Harvest can reduce server load by a factor of over 6,000, network traffic by a factor of 60, and index space requirements by a factor of over 40 when building indexes compared with other systems, such as Archie, WAIS, and the World Wide Web Worm.
So the benefits are indeed non-trivial. This systems assumes a single entity running the system, however. It seems to me that the kind of distributed search that would be implementable on today's internet would have to rely on various entities, requiring some sort of trust network to govern it.

Phil, the idea isn't a million spiders crawling the whole web, the idea is a million spiders crawling their local network. So your site blocks all but your local spider, and voila, your search engine bandwidth drops by an order of magnitude. Will people still use sucky queries? Of course. :-) This is a baseline problem not amenable to attack by present technology, unfortunately.

[Nielsen on writing for the web] .... thought I would try it out. :-)

So go for it!, posted 11 Aug 2000 at 23:24 UTC by Ankh » (Master)

The Dublin Core is an example of metadata aimed to help precision and recall; right now, you can only rarely say, find me all the web pages written by Liam Quin, because a web page might not contain an author's name. The Dublin Core addresses that issue.

Tools such as Metacrawler are a first step towards distributed searching. One difficult question is how they are to be paid for. On a small scale, a tool such as metacrawler can carry adverts, and links to the feeder sites. As you scale up, each feeder gets less and less advertising space. Of course, as the demands reduce on the individual feeders, it matters less, but there's a minimum cost associated with running a server.

If all the feeders use the same search engine, you can use ranking in the style of Salton et al., although that breaks down as the material becomes more disparate. For example, I've been working on putting a set of 30 encyclopaedias on the web. The full text is available by subscription only, but search engines are given access to the abstracts. Free advertising for the publisher (my client), but aggravating for people who get hits on them and can't read the text of the article. Worse, these articles are almost always "highly relevant" by Salton's metric.

So there are possible misuses, as there are with all technology.

I think this is an area where research is needed, and maybe some good papers at WWW9 or 10 or the ACM SIGĀ IR could come out of it. And maybe some useful software too!

Some numbers..., posted 12 Aug 2000 at 08:43 UTC by tetron » (Journeyer)

According to the Internet Software Consortium, there are roughly 72 million hosts on the internet, and according to netcraft approximatly 18 million of those are web servers.

Let's say that there are 50 million users on the internet at any given minute, and at any given minute 1% of those users are doing a web search. This means there are 500,000 requests going through a minute, or 8300 per second. Feel free to dispute these numbers because I'm more or less pulling them out of my arse :-) My intuition is that they are probably higher, and will certainly be much higher in the future.

Second, the number of web pages out there is in the billions. Google lists just over a billion, but once you consider pages that are blocked, dynamic, or were just plain missed, the actual number of pages could easily be several times that.

A "pure" distributed search might involve contacting every web site and submitting your search directly. The up side would be that (assuming each site kept recent indexes) you would have up-to-the-minute results, no dead links. The downside is that this obviously impossible, as there is neither the bandwidth to contact 18 million web servers nor the CPU power for each of those servers to handle 8000+ requests per second. This is a lot like how Gnutella works, from what I understand.

The other end of course is to have a single index maintained at a single site. All search requests go to that index, and a single computer (or cluster) serves that 8000 requests each second. Very fast and bandwidth friendly, but also very prone to going out of date, since it must index the _entire_ several billion pages of the web on a regular basis. To index everything in a month would require processing 1000000000/(24*3600*30) = 385.80 pages per second. Although that's do-able now, the number of pages on the web is growing with a steeper _exponential_ curve than Moore's law...

The current search system is actually between these two extremes. We have many redundant search engines, each of which attempts to keep a centralized database. To some extent you get the worst of both worlds: the stale page problems of centralized systems and the bandwidth problems of distributed systems. On the other hand, because the system is redundant (don't like what Yahoo gives you? Try Lycos. Don't like what Lycos gives you? Try Google. Don't like what Google gives you? Try Altavista... etc) you are both protected from server crashes and from deficiencies in any one database. The value of redundancy in this case should not be underestimated.

What I would be interested in seeing would be a _partially redundant_ distributed system. What I mean by this is that any one node would not be expected to know the entire web. Rather, each node specializes in a certain domain (perhaps a certain set of DNS domains) which is small enough that it can be scanned on a regular basis (once a week, or even every few days). However, the set of domains for each node overlaps - meaning the loss of a single node doesn't affect the system, because the entirety of its database can be reconstructed from the databases of several other nodes which share parts of the domain set.

Searching can now proceed by submitting one's query to each of these servers, of which there might be a couple dozen or so. To minimize network bandwidth, we could even use multicast routing :-)

To take this to the extreme, one could extend HTTP to allow for page subscriptions - meaning that subscribing nodes are pushed some kind of update notification when a specific page changes. Search engines could then maintain totally up-to-date indexes and minimizing load on both the server and search engine.

Somehow I doubt we're going to see anything like this, though. It would require either a single company set up a whole lot of geographically networkologically disparate servers or that multiple search engine companies somehow colaborate (yea right.) This is the problem of most real-world distributed systems: they assume all the parties involved actually WANT to cooperate. In the case of companies out to make a buck, that's not necesarily the case.

There is a basic disparity between the bandwidth and CPU available to enough random hackers to set up a viable distributed network - you might be able to get together a ton of small nodes, but the fact of the matter is that the optimal distribution (and in particular bandwidth utilization) is probably going to be more towards a few dozen bigass computers and not thousands of small ones. I won't mind being proven wrong though :-)

Multicast would make this possible... , posted 14 Aug 2000 at 19:28 UTC by bbense » (Journeyer)

From a reply above:

A "pure" distributed search might involve contacting every web site and submitting your search directly. The up side would be that (assuming each site kept recent indexes) you would have up-to-the-minute results, no dead links. The downside is that this obviously impossible, as there is neither the bandwidth to contact 18 million web servers nor the CPU power for each of those servers to handle 8000+ requests per second. This is a lot like how Gnutella works, from what I understand.

I have long maintained that the "killer app" for multicast IP is distributed searching or more generally the problem of resource discovery. Multicast would allow you to make this kind of search without chewing up enormous amounts of bandwidth. The problem is not asking the question, it's getting the answer. The big problems are :

1. How do you deal with the resulting deluge of answers?

2. How do you deal with "bad actors" that send "false postives"? (ie. the sex site the sends it's url for every search. )

It's really interesting to me to see the huge industries that have grown up around fundamental design flaws in http.

It seems to me that a middle way that had a centralized search site, some caching of results and a standardized way of doing a multicast search request would stem the tide somewhat. There is no way the current model of a centralized index of all web pages can scale.

- Booker C. Bense

Re: Meta, posted 18 Aug 2000 at 13:32 UTC by matt » (Journeyer)

I don't know, the random bold text kinda makes it look like a Google cached page with hits highlighted. :-)

More Numbers:, posted 18 Aug 2000 at 19:49 UTC by billgr » (Journeyer)

tetron,

a _partially redundant_ distributed system. What I mean by this is that any one node would not be expected to know the entire web. Rather, each node specializes in a certain domain (perhaps a certain set of DNS domains) which is small enough that it can be scanned on a regular basis

This is pretty much exactly what I'm suggesting, as well, with the addition of caching so that when search requests come in to, say, the caltech.edu search server, it first looks in its cache for hits, then goes out to its peers and asks them. At some point, I get bored of waiting for thousands of results to crop up, or get satisfied with what has come over already, and stop the dig. An added benefit is that if I'm really patient, I can just wait for days until the search process grinds through every single server on the net. (Of course, this is glossing over a lot of issues...)

Local servers can be expected to keep up-to-speed on their corner of the web. And of the 10k queries per second, probably 95% are cacheable -- I've done some search engine voyeur watching to look for what search terms people use, and there is more redundancy than one might expect.

Another benefit to such a system would be that even if it didn't work, and we still wanted a few big centralized repositories, they would benefit enormously from the distributed "tendril" search servers. Instead of having three dozen robots crawl my site, my search server subscribes to page diff notices from me (or crawls once a week), and then communicates results in compressed form over high speed lines to the three dozen centralized repositories. Or heck, they can make tapes and fed-ex them. :-) (Unattributed quote: Never underestimate the bandwidth of a station wagon stuffed full of tapes flying down the freeway at 80mph :-))

What would make the search engines cooperate? robots.txt files excluding all but one server from practically every website would do it. ISP blocks at the router level on traffic from search engines would do it.

Booker,

1. One answer is to stop the search process when you either get bored or when you have what you need. Hopefully what you need comes out of the local cache, so most of the net doesn't even know you asked for it.

2. This is a big problem. Trust networks, a la Advogato, are really needed for any kind of distributed venture like this. Would the Advogato method work exactly? I don't know. That's what this place is about, though: finding out. :-)

Some More Technical Issues with Distribution, posted 21 Aug 2000 at 14:15 UTC by harrisj » (Master)

Well, this issue is of a somewhat professional interest to me, since my company is also in the search engine business. It's intriguing, but I think you should definitely look into what Harvest does first before you start writing any code. The Dublin Core is also fascinating, but it is difficult to find many web pages that comply.

I think the distributed model is a nice future for the web, but there are some other technical issues that also complicate the issue. Tetron has already noted the need for multiple redundancy, since there can be a problem with nodes going down. His multiply redundant search engine decribed seems similar to what all of the major search engines are doing internally already. The only difference is that he provides a self-spidered index of the site to the remote engine, rather than having the engine spider his site.

This is not a bad start, and I think it's what Harvest is shooting for. Of course, there are some issues that need to be ironed out. Many people here have already noted the "trust" issue, so I won't go into that. However, limiting your searches to trusted sites can cause problems because it hinders discovery of new useful sites that have not received certification yet. Another problem is consistency. For starters, it makes it difficult to make any improvements to the search engine, since you need to propagate the new code out to all sites and wait for them to reindex before you can use any of the changes. You then will still have stragglers that have inconsistent information with your main index and may affect the quality of search results. Sites will also be internally indexed at different times, making it impossible to make any guarantees to the freshness of your index (eg, Google says it rebuilds its index every month). Indeed, there is nothing to guarantee that an internal index is maintained, and this could actually increase the number of dead links for your search.

Incidentally, there are some other approaches to improving the quality of web searches that I feel like mentioning. The simplest of these are the "popularity" metrics that engines like Google, Clever, DirectHit use in which they attempt to assess the popularity of pages and increase the relevance of more popular hits. Some groups are experimenting with improving on the inverted indexes used by such searches. Some groups working with Natural Language parsing or Latent Semantic Indexing have reported promising results, but it's unclear on how well their performance scales. Some sites (like our product) have followed a hybrid approach, limiting their indexing to classified sites within a specific domain (eg, business sites), to produce more specific results for users inside their domain.

Well, let me think about this a bit more...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page