Deep Web

From left, Eleanor Frierson, Tom Lahr and Walter Warnick are lining up new Science.gov services.

Science.gov 2.0 plumbs depths of federal data

Scientists at federal agencies have filled billions of database fields and millions of Web pages with their research results. Until recently, much of it was inaccessible to commercial search engines.

The 2-year-old Science.gov portal now can reach 47 million agency pages as well as databases, drilling down into what researchers call 'the deep Web.'

Science.gov 2.0, which went live last month, takes a further step toward unifying the presentation of hits found by the portal's Distributed Explorit engine.

The original portal could only group the results according to the databases from which they came. Version 2.0 presents the hits in one set, ordered by their usefulness to the searcher.

The first incarnation 'overwhelmed you with too much information, especially if you had a broad query term,' said Walter Warnick, director of the Energy Department's Office of Scientific and Technical Information and a founder of the cross-agency Science.gov Alliance.

The alliance, with members from 11 agencies, wanted to 'bring isolated islands of information together and make them all searchable by one query,' Warnick said.

The advanced search technology itself grew out of Energy-funded research. In May 2003, the department's Small Business Innovation Research program awarded Deep Web Technologies LLC of Los Alamos, N.M., a $99,000 grant to develop advanced document sorting techniques.

Although the research wasn't originally intended for Science.gov, it quickly found a home there because the alliance wanted its software to make more and better judgments about hits, said Deep Web founder Abe Lederman.

The research abstract stressed 'machine-learning heuristics to minimize the processing required to find the best documents,' he said.

Lederman said scientific users have particular search needs that cannot be met by a commercial engine. They generally look for a set of documents, not a single page, and often they want all the material that exists about a topic.

A search term such as 'pesticides' spans agency research at the Agriculture Department, NASA, the Navy and other agencies not ordinarily associated with pesticides, said Eleanor Frierson, who co-chairs the Science.gov Alliance and is deputy director of the National Agricultural Library.

Lonely hunter

Furthermore, commercial search engines do not index databases, where most scientific papers are stored. So researchers until now have had to go from agency site to agency site, hunting down relevant information.

'You would first have to determine what agencies were likely to have your information, and then you would have to find the particular database that would hold that information,' Warnick said.

In 2000 and 2001, a number of conferences were sponsored by an interagency group called CENDI, short for Commerce, Energy, NASA and Defense Information Managers Group. The organizers wanted to explore new technologies for easing dissemination of scientific research.

CENDI reported after the April 2001 conference that 'agencies can take full advantage of the Internet and other technologies to overcome arbitrary boundaries so that the government can provide the public with seamless, dependable online services.'

The Science.gov Alliance, formed after the CENDI workshops, received two grants of about $90,000 each to set up the first Science.gov portal. One grant came from the Office of Management and Budget, the other from Energy's Office of Scientific and Technical Information.

The portal's search engine architecture has two components. The first, managed by the Geological Survey, has a directory of about 1,800 static government research sites. The other part grabs information from about 30 government databases.

When a query is entered, the search software sends commands embedded in HTML to each individual database's search engine.

The Deep Web software ranks results in order of relevance, using several techniques. For example, it assigns higher rank to a research paper with the search term in the title rather than only in the text.

The alliance continues to recruit agencies for Science.gov. The portal has pointers to an estimated 90 percent of government research, but some areas remain untouched.

An agency interested in joining the alliance has to pay $7,500 per year and ensure that its own content is ready for searching, Frierson said.

The future Science.gov 3.0, expected next May, will do Boolean searches, or multiple terms with qualifiers. For instance, the search phrase 'food safety not pesticides' could find documents about food safety not involving pesticides. The future site will also automatically notify researchers of newly posted documents in their fields of interest.

Science.gov, which runs on Office of Scientific and Technical Information servers at Oak Ridge National Laboratory in Tennessee, uses Apache Web server freeware and the Sun Solaris operating system.

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above