A sharp eye for details

Enterprise search systems must scour more than ever before

CHOICES FOUND: Search appliances from Thunderstone Software and Google.

Google OneBox

Giving users near-instant access to precise information is a tall order, even within a specific application. But expand that requirement across all an agency's data, or even sources of related data, and the task could seem insurmountable. Yet that's the promise of enterprise search technology.

Say 'search' and most people think of consumer sites such as Google, Yahoo and Microsoft MSN. These sites index a significant portion of the public Web, including not just HTML pages but documents, images and now even video.

But the demands on enterprise search engines are different. They need to reach not just into Web sites, but also network file systems, e-mail repositories, content and record management systems, and even databases and business application platforms. And while users can often tolerate Web search engines serving up piles of marginal results, enterprise users can't afford to spend the day clicking through such results.

'From our point of view, search is a means, part of an application or a business process,' said Lubor Ptacek, director of software product marketing for storage and document management vendor EMC Corp. of Hopkinton, Mass.

Search gets specialized

As you might expect, government applications and business processes can be unique. So agencies' search platforms should meet a variety of important criteria. For one thing, the kind of information some government users look for can put much greater demands on search systems. Those searches may be for specific citations in much larger documents, or for the recurrence of terms in multiple documents in different languages, or for patterns of information such as phone and Social Security numbers. Moreover, the information may have associated security levels, requiring integrated access controls. Fortunately, evolving search technology can support most agency missions.

Enterprise search is rapidly encompassing a number of other technologies, ranging from ad-hoc database query tools to advanced pattern recognition and relationship analysis software. At the same time, search is being incorporated directly into an array of other applications. For example, there's software called Splunk, a search tool that can find patterns in system and error logs to help administrators pinpoint the source of an IT problem.

Enterprise search platforms generally fall into four categories: departmental-level systems intended for smaller sets of documents within a network or intranet; general-purpose enterprise search systems such as Google's Search Appliance; high-end, customizable systems for specialized types of search; and document management systems that can be adapted.

The lines between those groups are becoming blurry. Depending on the application, any or all of them may end up being part of a proposal. While document management companies are providing connectors to external search engines for certain tasks, search-engine vendors offer integration with content repositories generally considered document management turf.

Consider video search, for example, which has been the exclusive domain of specialized digital-asset management systems in the past. Fast Search & Transfer, an enterprise search vendor, offers the ability to search video footage based on text in closed-caption content, as well as voice-to-text translation, shape comparison, sound recognition and other criteria. FAST's technology is used not only by organizations managing their own content, but also by law enforcement agencies to ferret out child pornography and other illegal content.

Part and parcel

Vendors of enterprise content management and records management systems, such as Waterloo, Ontario-based Open Text Corp. and EMC's Documentum division, approach enterprise search as a component of their applications.

'Really, enterprise search is about going across all those silos of information included in records management systems, as well as documents that fall outside of RM solutions today,' said David Schubmehl, vice president for discovery products at Open Text.

Although not considered search companies themselves, Open Text and Documentum both rely on technology called 'search federation''executing searches across multiple data and document repositories and providing those results along with search results from within the application or system users are accessing. They use software connectors to access data in repositories of other applications, or even external search engines and online data services.

'We have the ability to organize and classify the information in multiple repositories,' Schubmehl said, 'and provide some taxonomic way to browse it.'
In fact, many search vendors offer federation. Vivisimo's implementation of search for FirstGov.gov [GCN.com, GCN.com/578], for example, relies not just on its own crawl of .gov sites, but also federates searches across the MSN Web search engine.

Google has also moved toward search federation in its Google Search Appliance with the introduction of OneBox (as in, one search query box), which provides access to external data sources such as those from Cisco, Cognos, Employease, Netsuite, Oracle, Salesforce.com and SAS.

'You can type a purchase order number into the search box and get information from Oracle Financials,' said Matt Glotzbach, head of Google's enterprise products. 'Or you can type in [a query] and get information back from a customer relationship management application.'

Google can also integrate its search technology with an enterprise version of its Desktop Search tool. Or agencies could pull in data from Google Earth Enterprise in order to overlay search results on a map, three-dimensional model or satellite image.

Progress Software's EasyAsk, another enterprise search tool, takes a unique approach. It provides different types of query and navigation interfaces for different types of data. 'We give special emphasis to the universe of reports,' said Dr. Larry Harris, vice president and general manager of Progress' EasyAsk division. 'The conventional view of search'full text indexing'is fine for unstructured data. But as you move toward more structured data, the methodology works less well'records in a database are treated as documents, which ignores the benefit of the structure they're stored in.'

Just how accurate search platforms are is determined largely by the content 'map' they have to work with. These maps are often generated around technologies such as taxonomies, metadata and entity extraction.

A map of content

A taxonomy is a defined structure of content classification. Document and records management systems usually come with at least one predefined taxonomy, as well as tools for organizations to import or build their own taxonomies. Enterprise search engines can use those same taxonomies to categorize the content in enterprise file systems and other information repositories.

Getting information into a taxonomy automatically requires powerful content processing tools. Those tools depend on two sources of information: metadata, or data about the data, and actual information within a document itself, including identifiable names and concepts commonly referred to as 'entities.'

Entity extraction tools find blocks of information that match sets of defined entity types'the names of people or places, phone numbers, addresses, etc.'and create indexing information for the document or data. A similar technique relies on rules-based processing to determine the proximity of words to each other, thus discovering concepts within information.

At the Homeland Security Department, officials use a FAST search engine along with a tool called Teragram Categorizer to automatically categorize and extract concept information from policy and strategy documents in its Homeland Security Digital Library.

'We have 20 Boolean operators for classification,' said Dr. Yves Schabes, co-founder of Teragram Corp of Cambridge, Mass. 'You can group things based on concepts. For example, if you want to define a rule that recognizes publicly traded companies, if more than one is found then the item is about 'business.' '

Keep in mind, however, when you build an enterprise search platform that the concepts and entities within an information source can be more important to many searchers than how they fall into a predefined taxonomy.

'Too often enterprise search deployments focus too much on what data should be indexed in the system and not on how people find information,' said Bob Tennant, CEO of Recommind Inc. of San Francisco. 'For example, people often look for information based on a concept, without even knowing the right key words. Once they have found information that fits the context of their search, they may need to dig deeper, exploring more than one possible angle. Finally, they often need to relate the specific information they find with other organizational information in order to act.'

And that's part of the search quandary. One size does not fit all. You must sit down with business process stakeholders to learn how people work. Looking for intelligence data in foreign-language content? You'll need specialized entity extraction and text analytics software. Just want out-of-the-box search for your departmental LAN? Google's appliance or Coveo's downloadable Windows-based search product might be enough. And there are specialized search tools for legal discovery, intellectual-property policing and nearly every other imaginable content analysis task.

S. Michael Gallagher is a freelance writer based in Maryland.

Stay Connected

Sign up for our newsletter.

I agree to this site's Privacy Policy.