Types of Information Collections

Essays in this search series:

The behavior of a search system depends on the kind of information the collection contains. Looking for a web site about Star Wars is not the same as looking for a book on Star Wars; it's not just that you are looking in different places for this information, the difference derives from the difference of the information types. Most people on the web are probably familiar with looking up web pages: you go to a search engine page like Google or AltaVista and you type in some words that you expect to appear on the page, click the submit button, and the engine goes out and does its magic and returns you several billion lines of results.

Document Collections

When you search against web pages, you are performing a search based on text. The search engine may take the string you typed in and use text retrieval logic to look for that exact sequence of characters in the web pages it has indexed, or it may perform logic on your string to derive words or word stems and then look for those values. Web searches usually let you specify the relationship between the words, such as their proximity to one another, as well as their required location on a page, such as whether they must be in a page title. Some web page search engines even look for the semantic meaning of your search parameters.

Library science deals extensively with strategies and methodologies for information retrieval within document collections, and a collection of web pages is just another type of document collection.

Many users employ search as a mode of navigation, rather than purely as a means of information retrieval. According to Jakob Nielsen,

Our usability studies show that more than half of all users are search-dominant, about a fifth of the users are link-dominant, and the rest exhibit mixed behavior. The search-dominant users will usually go straight for the search button when they enter a website: they are not interested in looking around the site; they are task-focused and want to find specific information as fast as possible.

This use of search for navigating a site's information space is a source for many criticisms about a site's usability, as shown by [broken link Jared Spool's findings]:

Using an on-site search engine actually reduced the chances of success, and the difference was significant. Overall, users found the correct answer in 42% of the tests. When they used an on-site search engine (we did not study Internet search engines), their success rate was only 30%. In tasks where they used only links, however, users succeeded 53% of the time.

Product Catalogues

On the other hand, product catalogues are not document collections, and searching catalogues requires different understandings. Looking up book titles about Star Wars at an online book store is different from looking for Star Wars web sites because the collection of information about books, the product catalogue, is very different from a collection of web pages. Some of these differences include:

the architecture of the information storage
Product catalogues are best kept in databases, which allow for more efficient and accurate maintenance of product information and relationships. In addition, product information is an integral part of corporate data warehouses that track historical information on products, including sales and ordering patterns, and predict future trends.
scope of the catalogue
Whereas any web search engine will only have indexed a portion of existing web pages, any product catalogue is by definition a complete listing of every product in the particular system.
objects, not information
A product catalogue usually treats the products as objects, with each object having a set of characteristics and attributes. Moreover, a site that has a catalogue of products has the critical priority of making the products easy to find.
search against characteristics
When you search a product catalogue, you aren't searching the objects themselves, but rather you search against their characteristics and attributes. If you go to online clothing store and search for "khaki pants", you aren't searching against a collection of actual pants; you are actually searching against a set of fields that contains descriptions of products.
important information is not text searchable
Product information must help guide customer purchasing decisions, so the emphasis on product information rarely focuses on textual information, such as what a text _means_.

Users are less likely to use a search against a product catalogue as a means to navigate the site; for example, my research of Borders.com's logs shows that users don't typically use a book keyword search form as a way to locate the site's help files or information on shipping. Users do use searches to locate categories of information, such as topic or subject sections.

The Convergence of Product and Document Collections

Digital products are blurring some of the distinctions between document collections and product catalogues. The observation that product characteristics -- and not the products themselves -- are searched against is less valid when the product itself is textual. For example, if a company sells text reports, and provides a mechanism to perform textual searches against the report itself, then the rules for textual information retrieval would seem to apply.

Many commerce sites have information about their products that goes beyond the basic product catalogue. For example, commerce sites may have reviews about products and articles on using products. A music seller may have articles about artists, articles about genres of music, comparisons of musical works, etc. This information covers a middle ground between products and text documents, even though the storage of the information may be document based or database based.

Even if the site decides to link this product-related content closely with products, in effect making content such as reviews a characteristic or attribute of the product, it makes sense to make this content text-searchable.

Providing a mechanism for searching against content about products will aid users in tasks ranging beyond the simple "I'm looking for this product". The chances of successfully completing a constellation of tasks increase, as do the possible routes to specific product information.