'The nice thing is that we don't have to store' all the information, says Helen Mitchell of FDA's Center for Drug Evaluation and Research.
David S. Spence
Advanced search engines link many data sources
NRC's Dan Graser and his team found a central search engine let them avoid building a costly data warehouse.
David S. Spence
A commercial search engine ties together multiple pharmaceutical databases for the Food and Drug Administration's Drug Evaluation and Research Center.
A different search engine is helping the Nuclear Regulatory Commission level the mounds of paperwork for upcoming hearings on radioactive waste disposal at Yucca Mountain, Nev.
At both agencies, the search engines must interface with widely varying formats and repositories.
FDA and NRC decided to give their employees and others access through a browser interface, rather than building data warehouses to aggregate the volumes of information.
'The nice thing is that we don't have to store it all,' said Helen Mitchell, enterprise search product manager at the Center for Drug Evaluation and Research, which built a search gateway to 15 data repositories. 'All we have to do is point to and index it.'
With so many government repositories available, employees often squander time doing the same searches with different search engines, said Dave Connor, federal vice president for search engine provider Convera Corp. of Vienna, Va.
Connor said he has seen analysts at intelligence agencies spend most of their days jumping from one search engine to another, trying to find information on a single topic.
Convera, like other search vendors, sells an alternative approach. Agencies buy a basic search package and add modules to access specific formats, Connor said.
'Most of our government customers collect information from multiple repositories,' said John Cronin, government sector vice president at search vendor Autonomy Corp. PLC of Cambridge, England. 'They have multiple file servers, databases and intranet Web pages,' Cronin said. 'In the old days, they'd do a brute-force integration and build a data warehouse. But you don't have to'you can leave things in their various formats and locations, and they can appear to be integrated.'
The Center for Drug Evaluation and Research set up a single search page with RetrievalWare software from Convera so its 2,500 scientific reviewers could collect data more efficiently for investigations. The workers 'basically can make a single integrated query of all the different libraries,' Mitchell said.
To search 15 electronic repositories on the center's intranet, they simply open a browser and type 'Enterprise Search' to view the internal collections.
The center, a consumer watchdog for U.S. health care systems, tracks reports about medications and other products after they have been released into the marketplace.
In 1993, the center started a master file of individual drugs. It worked aggressively to make everything available in electronic form at a time when submissions from pharmaceutical companies more often came on paper.
So the center scanned and converted paper documents into TIFF files. More than one researcher could view a file at the same time, and multiple copies didn't consume storage. A pharmaceutical company could refer in subsequent filings to a case number it had submitted previously'on, say, the packaging of a product'rather than resending the information each time.
The staff liked the electronic library, so the center decided to add more data resources. One was a collection of adverse event reports.
FDA gets about 130,000 reports each year of bad reactions to medications and other products.
Those in paper format were scanned and kept on microfiche, with the metadata stored in an Oracle database. Although the metadata didn't include the full narratives or supporting documents, it gave researchers enough information to search for selected attributes, such as the location of an incident or a victim's age or gender.
Once those files were linked to the search site, researchers could use the metadata to bring up scanned images of the paper files.
In October 2003, the center began indexing another data source'FDA's repository of drug applications in Adobe Portable Document Format, stored in a document management system from Documentum Inc. of Pleasanton, Calif.Most popular
This last addition has been the most popular of all, Mitchell said. 'Researchers in a meeting can pull up RetrievalWare and search for all reviews of a new drug application. Or maybe they want to search for documents in a particular date range, or they want just the chemistry reviews. They can specify that type of information.'
The center set up one library of more than 65,000 documents in a collaboration area, so groups of researchers can maintain directories of items with shared interest. The documents can be in Microsoft Word or Excel, Corel WordPerfect or other formats.
There are other new libraries for pulmonary drugs, biopharmaceutics, terrorist concerns and help-desk documents.
Mitchell is always on the lookout for more libraries to add. A department with information it wants to share only has to notify her about which directories to index.
'It's a work in progress,' she said.
At NRC, the approach is slightly different. The commission set up a site called the Licensing Support Network, at www.lsnnet.gov, to corral all the electronic files for the Energy Department's application to house a radioactive waste repository at Yucca Mountain.
The documents reside across many servers of groups participating in the application hearing.
The distributed approach 'is a little bit unusual' for building a legal document discovery database, said Dan Graser, administrator of NRC's Licensing Support Network. But he felt it was the best approach.
'There are always questions about the pedigree of a document that somebody introduces into evidence,' Graser said. 'We figured it would be better for the parties to maintain custody of their own documents.'
Energy's proposal to bury tons of nuclear waste at Yucca Mountain is in hot dispute. The state of Nevada is expected to contest it, as are many nearby counties, the National Congress of American Indians, environmental groups and other parties.
NRC, which will hold the hearing on the application, must provide the parties with electronic access to all legal documents.
Graser and his team decided the best way to make everything available would be through a central search engine, so interested parties would not have to go to multiple sites. Yet there was no justification to build a costly data warehouse for one hearing.
'The parties maintain their own document collection servers,' Graser said. 'They have custody of their own material. What we do is spider those collections of material nightly.'
Currently 20,000 legal briefs, contentions and other papers are online. Eventually, there will be about 15 million pieces of information as legal teams generate more material.
The agency started the portal project in October 2001. It purchased the Intelligent Data Operating Layer Server, an Autonomy software suite with both portal and search features. The purchase was through reseller AT&T Government Solutions of Vienna, Va.Multiple formats
Graser said Autonomy proved to be a good choice because its search software can handle many formats. The documents are in Extensible Markup Language, HTML, PDF, text files, TIFF image files and other formats, said Matt Schmit, the project manager.
NRC also looked at a portal and search suite from Vignette Corp. of Austin, Texas, but Graser said Autonomy's search engine met their needs better.
'These documents are very large, very dense and very rich,' Graser said, and also very similar to each other. With so many common terms, a conventional search system would return far too many hits to be useful.
'One of the things we were looking for was a lot of latitude in relevancy ranking and a natural-language user interface,' Graser said.
Relevancy ranking places documents that probably are the most useful at the top of the results. A natural-language interface accepts queries in simple English.
AT&T hosts the search facility in Ashburn, Va., with 18 Hewlett-Packard Compaq servers running Microsoft Windows Server 2003. The site serves public users, and there is a priority version for the participating parties.