Search and enjoy

Search companies try new techniques for understanding subtle distinctions within gigantic piles of data

Search strategies

How does a search engine pick the items it chooses to present to you when you type in a search? Most search engines are built with a number of basic techniques and trade-offs in mind.

Term-based ranking

Most search engines use a mixture of term frequency and inverse term frequency. Both are simple mathematical operations: Term frequency assumes that the documents that contain the most instances of a query term would be the most relevant to the user. Inverse document frequency assumes the opposite ' that the terms that appear least often are best indicators of relevance of that document.

Page rank/citation analysis

Pioneered by Sergey Brin and Lawrence Page, who later built Google from this technique, PageRank judges the relevance of Web pages by how many other Web pages link to that page. Although useful on the Web, the approach has limited effectiveness in most enterprise search, where linking is not common. The Google founders cribbed the idea from citation analysis, a technique for evaluating which academic papers are most important by counting the number of citations they garnered in subsequent papers.

Recall vs. precision
Most search engines strike a balance between recall and precision. Recall is the percentage of all appropriate documents returned during a search. Precision is the percentage of documents returned that are relevant to the user. In most cases, recall is inversely proportional to relevance: A 100 percent recall rate could overwhelm users ' thus reducing the precision rate ' while an overly high precision rate could leave potentially useful documents hidden.

Mike Bentley

The brain is trained on keywords and the most difficult part of setting this up is to start getting people to think in entire questions. ' Constance Werner Ramirez, Federal Preservation Institute

When the National Park Service set up the electronic clearinghouse for historic-preservation information, one of the toughest challenges was not technical but rather behavioral.

'The brain is trained on keywords, and the most difficult part of setting this up is to stop people from thinking in keywords and start getting them to think the way we used to think ' in entire questions,' said Constance Werner Ramirez, who is the director of the Federal Preservation Institute.

Using the search engine embedded in the Web portal (GCN.com/764), users really could type in a full sentence and get a better result. 'How to clean mold from books and photographs?' will lead to many results from different agencies, universities and other organizations. 'You want to give this system a lot to work with,' Ramirez said.

Search engines are getting better, slowly but surely. High-end software, such as the software from Autonomy that runs the Historic Preservation portal, is making headway in offering users results they can actually use.

'Basic search has remained unchanged since the mid-1960s,' noted search consultant Steve Arnold, who spoke at the Gilbane Conference on Content Technologies for Government held in Washington last year. And users are starting to notice the limits of this older technology.

'When you're doing search in the enterprise, the basic search tools make finding information pretty darn hard,' he said. Make your search too wide, and you get an ocean of results with little way of finding the good stuff. Yet cast your search too narrowly, and you won't get any hits at all. And even the best basic search approaches can miss 20 percent to 35 percent of the information out there, Arnold said.

Nonetheless, search engine companies are trying to add techniques to the basic search algorithms to add coherence to search results. And some of the work they are doing looks promising.

The basics and beyond

The basics of search are rather simple. At the most rudimentary level, the search engine returns all the documents that contain the query phrase. This approach is called term matching.

Databases allow for more flexibility in this approach, because all the material in a database is structured. Each data element is mapped to a predefined field. As a result, a query against a set of structured data can be more elaborate, allowing the user to logically hone the query using a delicate combination of fields. 'SELECT name, city FROM SampleTable ORDER BY name' basically is asking the database system to build an alphabetical list of all the names in the SampleTable.

Making sense of unstructured documents ' the fancy name for all the word processing documents, spreadsheets and Web pages that make up the vast majority of an organization's content ' is a far more difficult task. The problem? A computer program won't know the relationships among all the words in a given document, or across a given set of documents. The search engine simply has to scan and index all the terms in all the documents under its purview and then offer pointers to those documents with the query term.

Some headway has been made in the past few decades to make sense of these large indexes of words.

There are two different approaches to refining searches, said Michael Lynch, president and founder of Autonomy. One approach is linguistic, which means the software tries to look at the relationships among the words in documents to infer some meaning about the words themselves. The other approach is probabilistic, which just looks at any statistical trends that can be mined from the documents as indicators of importance.

Linguistic matching is the most ambitious but, so far, not the most successful.
'Vast amounts of research [on linguistic search] has been done on this over the years, and it generally hasn't worked,' Lynch said. 'You haven't seen many commercial uses of it yet.'

He said the downside to this approach is that the rules of language are not absolute, and there is a lot of meaning lost in semantic ambiguity. Take a sentence such as 'The dog walked into the room. It was furry.' Most people would assume that the dog was furry, but a computer, using strict rules of interpretation, would assume the room was furry.

The linguistic approach does work well in environments where the scope of the search is quite small, and people tend to ask similar questions. 'The linguistic effort can be very good when you know what the question will be, because you can put up the standard answers,' Lynch said.

Programs have had more success by disregarding semantics and focusing on mathematical techniques.

One of the oldest approaches in this latter category, for instance, is looking at term frequency and inverse term frequency, noted James Melzer, an information architect at SRA International. Term frequency simply counts the number of times a term appears in a document. The more times it does, the more likely the document is about that term.

But you could also derive significance from the opposite approach ' the fewer times a term appears, the more likely it is that the term represents what is unique about that document. That's called inverse term frequency. Most search companies today use a mix of those two approaches, Melzer said.

Next steps

A more advanced approach that builds on these basic techniques involves clustering documents that appear to be similar, based on the terms contained within the documents. Here, documents with many overlapping terms are grouped together.

This approach allows the search to break free of literal term matching, as it relates documents that are similar but do not have a complete overlap of terms. A search for the word 'computer' will result in documents that may involve networking, the Internet or some other topic intimately involved with computers, even if they don't mention the word 'computing.' There are a variety of algorithms, such as Vector Space Model and Latent Semantic Indexing, that can execute this function, using differing approaches.

'Each approach builds a mathematical presentation of each document but also aggregates [documents] into clusters,' Melzer said. In some cases, the user is not aware this is going on ' the clusters just serve to shape the list of results that end up on the screen. In other cases, the clusters are presented to the users.

For example, do a search on 'Bunker Hill,' on the General Services Administration's USA.gov Web site, which runs on Vivisimo's clustering software via Microsoft's MSN search service. If you sort the documents by agency, you will get two major clusters, each with a different focus.

Because Bunker Hill is a national park, the National Park Service has information pertaining to the visitor and historical aspects of Bunker Hill. But Bunker Hill is also an Environmental Protection Agency Superfund site, because of toxic wastes caused by decades of mining. So links to the Centers for Disease Control and Prevention's Agency for Toxic Substances and Disease Registry are also presented, under a separate grouping. By clustering these two large sets of documents, USA.gov makes it easier for users to find what they are looking for.

Many search engines also use a related technique called Bayesian Inference, which looks at the mathematical distribution of words in a document and compares it with other documents, Lynch noted. As its name states, Bayesian Inference infers the major ideas behind the creation of a document, using the words as pointers to that idea.

For instance, people who write about the topic of dogs will tend to use the same set of words, even if some never use the word dog itself, Lynch noted. 'The big advantage to this technology is that it adapts to a changing world,' Lynch said, adding that, as new ideas make their way through our culture, the search engine should understand them at about the same time people do.

The Google influence

A more recent step forward in search technology has been the PageRank algorithm from Google, Melzer said. PageRank is a method of weighing the importance of each page by the number of other pages that link to that page. Because a site such as NBC.com gets plenty of links from other sites, it may rank more highly in a search on, say, television, than the site for Bob's Television Repair Shop, which may have only a few inbound links.

One thing to keep in mind about Google, however, is that PageRank only works well on the Web, where pages routinely link to one another. In enterprise environments, which typically contain more stand-alone documents, linking is rare, so the effectiveness of this approach is minimal.

'Generally speaking, enterprise search is not a popularity contest,' Lynch said.
Nevertheless, Google and other Web search engines radically changed users' ideas of what should constitute a search term. Most people think of searches as one and two words. 'You put in the word 'Sears' and Google gives you 'Sears.com' every time,' Melzer said. Many older search-and-retrieval engines did not do this approach very well, relying on people to enter Boolean strings or other advanced querying methods.

'On the Web, you have people who are typing in very short, very general search terms. I think that was the way Google revolutionized search'they weren't doing information retrieval the way everyone else was,' Melzer said. Now the enterprise search companies must catch up with the perception of what search is.

Out in the marketplace, other search companies look for other approaches to bring more relevant information to users.

Sometimes, information about a document can be garnered from its mere location. One of the ways search software from Isys Search Software categorizes its source material is to take into consideration the names of folders in which the documents reside.

'Usually, there is some sort of structure to the directories,' said Derek Murphy, president of U.S. operations at Isys. The hierarchy can help in the categorization of the documents.

Metadata can be useful in helping refine results. This is information that the user enters into the document about the document itself, such as who the author is and when it was written.

Isys software, for instance, can be configured to weigh metadata higher or lower than data within the documents themselves. If no one in your organization is filling in the metadata in documents, then the organization can de-emphasize that in the search configuration. But if the organization has a content management system that creates a lot of metadata automatically, that data can be put to use.

Isys has a built-in search function called 'espin' that focuses on the metadata fields. A search such as 'Einstein espin Author' would returns all those hits that carried the word Einstein, but it would put documents that listed Einstein as the author near the top, Murphy noted.

Isys also uses a technique called entity extraction. As the search engine is indexing a document, it can identify things such as names, street addresses and e-mail addresses. Isys has a list of rules the engine follows to identify these traits when the document is being indexed. Organizations also can add their own rules. When a user runs a search, a list of various entities that pop up in the results is listed down the right-hand side of the page.

Search engine companies also are starting to draw on other areas of computational science in hopes of giving greater context to the words on the page. In one case, business intelligence software provider Cognos has been looking at ways to export the knowledge created in BI systems to aid in search, said Paul Hulford, product marketing manager at the company.

Hulford noted that the average BI customer will spend a lot of time mapping out templates for reports. So when these reports are run within the BI system, either on a scheduled interval or when requested by the user, they draw up-to-the-minute data from the organization's databases.

Last fall, Cognos extended its Cognos 8 Business Intelligence platform so that its dynamically updated reports can be inserted into what a search engine can index. The results might not even be compiled before a user requests a copy of the report. The search engine will offer the user the ability to compile a report, though. In this way, BI is extending search into documents that haven't been created yet, Hulford said.

Text mining is another field that is benefiting search. One text-mining company applying its efforts to search is Attensity, whose software analyzes documents and extracts nuggets of information that later can be offered to users as points of information.

The software looks for who did what to whom, said Michelle de Haaff, vice president of marketing at the company. More formally, what is being extracted is something called triples, she said, which is a combination of subject, object and a predicate that defines the relationship between the two.

For instance, with a given set of literature about the Boeing 737, the software could extract when the plane was first designed, when it was built and how many are flying now, if that information is found in the source document. Search engine providers can then offer these results alongside or on top of a more standard list of relevant documents. The software can work with a wide variety of formats as well, including 'Word documents, databases, PowerPoint presentations, something that has been optically scanned,' de Haaff said. 'We have no restrictions.'

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above