Search Engine Myths
Web search engines are shrouded in hyperbole and unsubstantiated claims. This article attempts to distinguish between search engine reality and myth
One of the common myths associated with the web search engines is that the number of hosts and/or the number of web pages in the index is a good measure of its comprehensiveness. There are many reasons why neither of these are good indicators. I'll briefly describe a few of them. The number of hosts does not describe the number of pages within each host which are indexed. The number of pages is inherently flawed because it is currently impossible to determine the number of unique normalized URLs since it is often possible to reach the same file but using a different path. File systems are not acyclic. Another problem is that cgi programs can generate an unlimited number of unique html pages. Yet another difficult if not unsolvable problem is that the WWW is dynamic. Pages are created and destroyed continuously. Measuring the size of the web might be as difficult as measuring the size of the universe.

Regarding the comprehensiveness of the web search engines, the content of the web is composed of text, images, sound and video. Its discouraging to realize that none of the current search engines allow the user to perform a search based on anything but text. Thus, the far majority of the WWW is completely ignored. Within the universe of text, most popular search engines allow the user to search over the web and the newsgroups. Some search engines such as Infoseek also let you look at current news wires which lets you get up to date information on world news.

In summary, no search engine can claim to search the entire WWW. Nor can any search engine even know how many unique pages it covers. The numbers which they list are popularly believed to be lower bounds, but even this is not necessarily true. Search engines can assert that they cover a certain number of URLs, but it is never certain if these URLs are unique in part at least to the fact that the local filesystem at the remote page server may provide multiple links to the same file.

