分布式搜索引擎:search.minty dowser和larbin

      技术 2004-12-22 10:23

分布式搜索引擎:search.minty dowser和larbin

search.minty.org: Open, Distributed Web Search

Dowser:
Dowser is a research tool for the web. It clusters results from major search engines, associates words that appear in previous searches, and keeps a local cache of all the results you click on in a searchable database. It helps you keep track of what you find on the web.

Larbin: crawler

Open, Distributed Web Search
Murray Walker, search@minty.org

Consider this an RFC (Request For Comments).

Please add them to the wiki


--------------------------------------------------------------------------------

Abstract
This article discusses the concepts, goals and ideas for an open source, community-run (distributed) web search service that allows anyone to search, and thus navigate around the internet. A key tenet is impartiality and objectivity - avoiding bias, corruption and "spamming" of the service.

An outline of one possible technical architecture is presented. This does not cover the details of any specific algorithms or search techniques. No working code or implementation exists.


Introduction
The ability to find things on the internet, and particularly the world wide web, is often taken for granted by the vast majority of people who (often many times a day) are "searching the web".

Whilst there are a huge number of tools and technologies available to do this, a tiny handful dominate the market:


Google
Yahoo
Microsoft [1]
They, along with many other similar services, share a number of key features:


Their algorithms, or rules for deciding which results to display, are "closed" and often tightly guarded secrets.
They generate money, selling various forms of advertising around the search results.
They are commercial companies whose first priority is to generate value and profit for their shareholders.
There are no obligations or requirements for them to be objective, impartial, comprehensive or even accurate. And there is very little anyone can do to measure them against these metrics.
Given the power and influence these services have over everyone that uses them, it could be argued there is scope for abuse and at least the possibility for a conflict of interest.

Which leads to the idea for an "open" [2] web search service that is free to use - both financially and (more importantly) in the sense that it can be used by anyone, for any purpose. This means that the technical code and information needed to run such a system are "open sourced", but also that the service is provided by the community of users without imposing any restrictions on how the service is used.

It is not known if this idea would work, or if it is simply a description of a hypothetical, impractical or impossible scenario. Hopefully by reading and adding to the debate, it will be possible to find out.

The rest of this article discusses some issues that such a project might encounter, and outlines very briefly one possibly technical architecture that might be used to implement such a system.

If you would like to comment, you can do so on the accompanying wiki.


Assumptions and Scope

This article will discuss only a plain text web search service, including HTML files. This excludes features like image search for the sake of simplicity.
Many existing search services, including the "big three" mentioned above, have paid advertisements around the search results. They also all have separate, custom technology for delivering such advertisements [3]. The ability to do, or technology of, "ad serving" is not discussed here.
Custom search services, such as searching for books on Amazon or auctions on ebay, are out of scope. A generic web search service must be able to handle unstructured, plain text data.
The primary focus is a search service that requires no manual intervention or classification on a site by site, or page by page basis. Namely, once it is setup and running, it is fully autonomous.
Advanced features of existing services, such as viewing cached versions of pages, spell checking, etc. are also out of scope.

Resources : Bandwidth, Disk space and CPU
Consider an index of around 4 billion items, roughly the size of Google's current web index. Taking an average of 30Kb per item or url (only looking at the plain text - including html source), it would take around 111 Terabytes of data being downloaded to consider each document exactly once.

Then consider updates, which could range from quarterly or monthly for resources infrequently used or rarely changed, to weekly, daily or even hourly in the case of very popular or highly dynamic sites, such as news sites.

It is not the goal, at least initially, to store complete copies of everything downloaded. Thus, if a 30Kb item could be reduced to 300 bytes of information, disk space requirements would be around 1000 Gigabytes. It is also possible that in order to produce fast results when a query is run, it would require storing more than 30Kb per item. If it was a ten fold increase, it would put disk storage requirements at close to 1 Petabyte.

CPU or raw processing power is even harder to predict, but it breaks down into two distinct areas. There is the task of crawling or indexing pages. This can be done "offline" or in a batch approach. For the sake of argument, if a typical 2Ghz Pentium 4(r) computer could process 100 urls per second, that computer could do just over 8.5 million urls in a day. (Assuming the data could be downloaded this fast.) It would take that (single) computer about a year and a half to process the 4 billion urls. A hundred such computers could do it in just under 5 days.

Then there is the task of returning answers to queries in real time. Again, for the sake of argument, assume our typical 2Ghz Pentium computer could deliver 100 requests per second. Assuming it can scale close to linearly, the CPU resource required here would then become a function of how popular the service became.

Considering a worse case, where our benchmark machine is capable of crawling only 1 url per second, and answering queries at a rate of 1 request per second, it could easily require thousands of computers.

It is clear that the resources required are considerable.

Historically, approaches to this problem used a small number of massively powerful computers (the Mainframe approach). More recently, and the technique used by Google, Yahoo and most likely Microsoft, involve many hundreds or thousands of "commodity" computers (cheap, standard, generic machines) working in some degree of parallelism, using the network to communicate.

Taking this model one step further, it would appear that a distributed, "peer-to-peer" model could potentially be used to provide these resources. There is the potential for hundreds of thousands of computers, distributed across the internet all working together.

Existing (working) examples of peer-to-peer networks:


Traditional file-sharing networks such as Kazaa, which already allow millions of users to search across millions of files, representing terabytes of data, and then download those files simultaneously from multiple sources.
Internet Telephony networks, like Skype, which use peer-to-peer networks to route calls for free around the world in real time.
Less traditional (newer?) file sharing networks, such as BitTorrent which implement an "open" service similar to the traditional service offered by Akamai. Also Freenet which implements an entirely decentralised hosting network where both publishers and consumers are anonymous.

Spamming, corruption and abuse
Because it will drive more people to a web site, many people and organisations go to considerable lengths and cost to try and ensure their sites appear as high as possible in search results. This is commonly referred to as "Search Engine Optimisation".

An example might be trying to ensure your web site appears above those of your competitors - even when the search query was for your competitors name, or product.

However this is only the tip of the iceberg when it comes to bias and corruption in search results. For instance, a technique often referred to as "Google Bombing" resulted in Microsoft's website appearing at the top of the results when someone searched for "more evil than satan himself" towards the end of 1999.

And then, of course, there are laws, lawyers, legal systems and governments. Consider liable and propaganda to name but two.

There are a number of ways to address these problems, which are typically coded into the algorithms and rules used by the search service and are often highly guarded secrets. Although the specific implementation details are not public knowledge, many of the concepts are generally well understood and documented. PageRank(tm), as used by Google, is a good example of this. Various forms of hashing to try and identify similar documents being another.

Discussion of these techniques, while a critical and very complex part of any web search solution, are beyond the scope of this document. There are however two problems specific to a distributed

标签集:TAGS:
回复Comments() 点击Count()

回复Comments

{commentauthor}
{commentauthor}
{commenttime}
{commentnum}
{commentcontent}
作者:
{commentrecontent}