Internet Search Engines

The major search services on the Internet are essential starting points for users seeking information. As such, they routinely are some of the most visited locations on the Web. Search services can be divided into two groups, commercial and non-commercial. Commercial search services go to the effort to catalogue information on the Internet to attract attention and advertising revenues. Non-commercial services exist for many different reasons. Search engine optimisation is a service which enables effective online advertisement and marketing.

There are more than 2,500 search services presently on the Web. There are a dozen or more big, major Internet search services. There are also ‘metasearch’ services that provide a central access point to multiple of these services.

Search services on the Internet come in two main flavours: ‘search engines’ that index words or terms in Internet documents; and ‘directories’ that classify Web documents or locations into an arbitrary subject classification scheme or taxonomy. Most of the above are examples of the former; Yahoo, and LookSmart are examples of the latter.

Search engines use ‘spiders’ or ‘robots’ to go out and retrieve individual Web pages or documents, either because they’ve found them themselves, or because the Web site has asked to be listed. Search engines tend to “index” (record by word) all of the terms on a given Web document. Or they may index all of the terms within the first few sentences, the Web site title, or the document’s metatags. Due to the ever-changing nature of the internet, the services must re-sample their sites on a periodic basis. Some of these services re-sample their sites on a weekly or less-frequent basis.

Precision, recall and coverage are limiting factors for most search engines. Precision measures how well the retrieved documents match the query; recall measures what fraction of relevant documents are retrieved. Coverage refers to what percentage of the potential universe of relevant documents is catalogued by the engine. For example, consider a search engine with 10 documents, five of which mention eagles, out of a total universe of 50 potential documents mentioning eagle (45 of which are not indexed by that engine). A query on eagle that returned four documents and two others from this engine would have a precision of 0.66, a recall of 0.80 and coverage of 0.10.

Precision is a problem because of the high incidence of false positives. (That is why you get so many seemingly irrelevant documents in your searches.) This is due to imprecision in the query (searching on eagle and missing the mention of eagles), indexing mistakes by the engine, and keywords entered by the Web document developer that do not actually appear in the document. Coverage is a problem for all engines, with the largest ones only covering at most one sixth to one third of publicly-available documents.

Search directories operate on a different principle. They require people to view the individual Web site and determine its placement into a subject classification scheme or taxonomy. Once done, certain keywords associated with those sites can be used for searching the directory’s data banks to find Web sites of interest.

These distinctions by search service are not clean in all cases. The Excite search engine, for example, uses ‘morphological analysis’ for determining its keyword matches. While construction of the index is more akin to a search engine, in operation Excite can work like a directory. As other search engines begin classifying information into directory-like clusters, these distinctions are likely to continue to get fuzzier.

For searches that are easily classified, such as vendors of sunglasses, the search directories tend to provide the most consistent and well-clustered results. This advantage is generally limited solely to those classification areas already used in the taxonomy by that service. Yahoo, for example, has about 2,000 classifications (excluding what it calls ‘Regional’ ones, which are a duplication of the major classification areas by geographic region) in its current taxonomy. When a given classification level reaches 1,000 site listings or so, the Yahoo staff split the category into one or more subcategories. If a given topic area has not been specifically classified by the search directories, finding related information on that topic is made more difficult. Another disadvantage of directories is their lack of coverage because of the cost and time in individually assigning sites to categories.

Most searches of a research or cross-cutting nature tend to be better served by the search engines. That is because there is no classification structure behind the listings; only whether the keywords requested appear in that search engine’s index database or not.

The flexibility of indexing every word to give users complete search control, such as provided by AltaVista or OpenText, is now creating a different kind of problem: too many results. In the worst cases, submitting broad query terms to such engines can result in literally millions of potential documents identified. Since the user is limited to viewing potential sites one-by-one, clearly too many results can be a greater problem than too few.

Increasingly, the growth of the Internet is causing the specialization or balkanization of search services. Lawyers, astronomers or investors, for examples, may want information specifically focused on their interest topics. By cataloguing information in only those areas, users interested in those topics are better able to keep their search results bounded. Such specialization can also lead to more targeted advertising on those search service sites. Again, though, like the directories, such specialization can limit search results to the boundaries chosen by the service, which may or may not conform to the boundaries sought by the user.

The ultimate challenges to any of these centralized search services, therefore, are to: 1) keep pace with explosive document growth; 2) understand the “boundary” needs of their user communities; 3) provide sufficient “intelligence” to infer what users are really asking for even when their queries don’t specify it; and 4) ensure sufficient coverage to provide one-stop searching. In the race for eyeballs, user retention and repeat visits are key.