Shawn P. McCarthy

COMMENTARY

Why you should know the difference between search tools and discovery tools

Search, information discovery and e-discovery seek and display information in different ways

Government information technology workers might have heard the following three phrases used interchangeably: search tools, information discovery tools and e-discovery tools.

Depending on your definition, there is some overlap among the concepts. But there also are significant differences. Thus it’s important to understand the subtle and sometimes not-so-subtle differences among the terms, especially as government agencies are entering more information into sprawling storage and data archiving systems.

All three terms relate to seeking information across multiple data archives. But the three concepts are differentiated by the way searches are conducted and the presentation of results.

Search tools. This term often is used in a generic way to refer to multiple types of internal or external search engines, directories and information archives. Most search tools are usually designed to interact with a computer program — often a crawler, spider, indexing bot or similar system — that was created to retrieve documents or data. The crawler and its associated search tools can be set up to interact with one specific database, a set of databases, a single computer network or even the full Internet. When using such tools, searches often are based on a keyword, set of keywords, or a phrase that can be contained in one of the files that was indexed by the spider.

Doing a simple keyword search can be useful, unless there is ambiguity about the meaning of the term. For example, if you search for the word "Saturn," do you mean the planet, car, rocket, or old Sega Saturn game console?

To help resolve ambiguity, some search engines also collect information from a file's metadata fields. Metadata can be useful for setting the context of a keyword. If metadata indicates that a file contains information about the solar system and planets, a good search engine would assume that any matching keywords in that file refer to Saturn the planet, not Saturn the car brand.

But what if someone searching for Saturn the planet doesn’t remember the name of the planet? Or what if they are looking for information about planets in general, and they simply enter the name Saturn as one example? What they really need is more guidance built into their search results.

Information discovery tools. Some types of information discovery tools are simply multiple search results presented in a logical way to help users make additional choices. Some of the results are just interfaces to secondary search tools, arranged to help guide an evolving search.

A basic example of a discovery tool is the “Did you mean” feature that Google presents if you misspell a search term. Besides executing a keyword search, Google's search system also looks through a database of common misspellings. If it finds a match, the search results page helps you discover a correct spelling. But it doesn’t automatically assume you meant the correct spelling, so it still offers keyword matches for the misspelled version of your word.

Discovery tools can help refine your search or ask questions to help you make additional search decisions. Two excellent examples include the Recent Activity boxes on eBay or the “People who bought this book also bought” links on Amazon.com or Barnes and Noble's Web site. By tapping other databases and not just their own index of keywords and matches, those sites make fairly accurate predictions about other things that you might be looking for.

Information discovery should not be confused with semantics. In general, semantics means identifying the meaning of a word or phrase, and the Semantic Web efforts championed by the World Wide Web Consortium have made great strides in helping people understand this issue.

But the semantic approach is not a perfect solution when the people doing the searching don't know the specifics of what they are looking for, much less the exact word. Thus information discovery comes down to three things: available paths, context and pattern matching.

Available paths can be represented through additional line items that offer parallel choices from other databases. This is similar to Google’s “Did you mean” choice but significantly expanded to many different conduits of information. When a good information discovery interface is used to search for Saturn, you might receive a straight set of search results that is complemented by other options. Sometimes, they are presented as small search results boxes with two or three matching choices, plus a link that will take you down that particular result’s path. Other sets of search results might include links from a database on the solar system, a few documents on gas giants, a handful of pictures of planets with rings, and so on. The results help you discover other paths and encourage you to refine your choices. Following one of those paths in turn takes you to other search tools and resources.

Context comes into play when the search system already knows one thing about you or your search. It uses that knowledge to limit search results based on what it already knows. One great example of context is location. If you use your mobile device to search for, say, gas stations, the context can be limited to a 10-mile radius of your location. This function lets you discover nearby resources. Likewise, you also can find restaurants, ATMs or known criminals.

By limiting our example to, say, police needs, the various pieces might combine this way: Your context is where you are. Your search is for people who own blue Ford trucks. Your discovery tools present the various paths that have been enabled for you in the search results. Possible examples: Pull-down menus that define the age of truck, people who live in apartment buildings, the age of truck owners, and so on. Truly flexible discovery tools let you follow one path and then adjust settings without needing to start your search over again — such as expanding your search to 20 miles or limiting results to just Ford F-150 pickups.

Pattern matching applies if the discovery tools also recommend links that other people think are useful.

E-discovery. This is a different concept than the two terms we just reviewed. It can involve search tools, but e-discovery usually refers to a discovery process related to court cases, in which someone searches for information stored electronically. Information that might be relevant as evidence in a lawsuit includes e-mail; instant messages; logs from online chat rooms; stored electronic documents of all types, including older versions of files; databases, including research, product information, and accounting or finance databases; Web sites; and even raw data files. Because litigators might need to review e-discovery materials in a number of ways, it's not unusual for discovered information to be saved in multiple formats.

E-discovery tools often exist as specific applications, and they are popular with people who manage large archives of government information. In late 2009, EMC acquired e-discovery vendor Kazeon. With this addition, EMC offers a set of e-discovery and litigation readiness applications.

Understanding the differences among searching, information discovery and e-discovery can help government employees understand and use the concepts. That goes a long way toward helping people find the right information at the right time to do their jobs.

Reader Comments

Mon, May 10, 2010 Johannes Scholtes McLean

I fully agree with Shawn, I would even like to add to this that is is also very relevant to be able to explain the search tools you used in court (either yourself or by your legal counsel). More on the different faces of search can be found here: http://zylab.wordpress.com/2010/04/28/how-to-find-more/ and here: http://zylab.wordpress.com/2010/03/26/understand-the-two-different-faces-of-search-exploratory-search-and-focalized-search/ which basically share the same message as this post.

Sun, May 2, 2010

Great article outlining the differences. The one area that is also critical for e-discovery search provides capabilities that ensure transparency and defensibility in court.

Thu, Apr 29, 2010

One emerging field of research related to all three is the field of multi-document summarization systems. These systems produce annotated documents and reports summarizing the topic entered by the searcher, vs. SERP lists produced by today's search technology. Examples currently on the web include Ultimate-Research-Assistant.com, Shablast, and Newsblaster.

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above