USGS creates a trainable bot to filter Web searches

USGS creates a trainable bot to filter Web searches

By Patricia Daukantas

GCN Staff

The Geological Survey is spearheading a multiagency effort to glean biological information from the Web.



The National Biological Information Infrastructure (NBII) so far encompasses BioBot, a customizable search engine for biological information; a metadata effort to standardize descriptions of biological data sources; and FrogWeb, a site devoted to describing amphibians and their place in the environment.

The NBII site, at www.nbii.gov, also provides extensive links to government and nongovernment Web sites dealing with botany, biodiversity and other biological'not medical'topics.

'I always say NBII is the biological component of the Internet,' said Michael T. Frame, deputy director of USGS' Center for Biological Informatics in Boulder, Colo. 'We're trying to make it more manageable and make sure there's authoritative information out there.'

NBII grew out of the former National Biological Service, which began in 1993 and became part of USGS in 1996. The USGS Biological Resources Division leads the NBII effort with partners that include nine other federal agencies and numerous international and academic organizations.

The BioBot tool lets users search NBII's own Web pages, the research site Biolinks.com and the biology-related sections of four large, general search engines.

Frame said he looked at various search engines and chose ProFusion from IntelliSeek Inc. of Cincinnati.

The ProFusion metasearch engine can look at multiple sources simultaneously, said Susan Gauch, associate professor of electrical engineering and computer science at the University of Kansas in Lawrence, Kan. Gauch developed ProFusion as a research project before she and a partner spun it off as a company.

Speedy searches

The metasearch engine acts as six hidden browsers that query other search engines. It then parses the results into a single Web page for the user, Gauch said. The original ProFusion was written in Perl, but a soon-to-debut C++ version will add a bit more speed.

ProFusion was already doing much of what Frame envisioned for the NBII search engine, and so last December USGS signed a memorandum of understanding to customize it. The BioBot site went live in February.

Under the agreement, ProFusion.com will host BioBot for one year. The site resides on four dual 500-MHz Pentium III Dell Computer Corp. servers running Linux, Gauch said.

The main BioBot page lets users choose from six search sites: www.nbii.com, Biolinks.com, and the biology sections of AltaVista.com, GO.com, Snap.com and Yahoo.com. Searchers can ask BioBot to shop the fastest three out of the six options or the three with the best hits.

ProFusion keeps track of how quickly search engines are responding to select the fastest trio at any given moment, Gauch said. The engine's intelligent agent technology, dubbed Adaptive Search, monitors the results that users click on to determine which engines are the best.

Users can ask BioBot to check the links it finds to make sure none is broken.

Users of the MyNBIIFilter section of BioBot start by creating their own password-protected accounts. Then they enter standing queries and specify how often they want searches conducted'daily, weekly, bimonthly or monthly.

Customized service

MyNBIIFilter afterward asks whether each link was relevant. By clicking one of the accompanying radio buttons, account holders can train BioBot to refine its searches.

For example, an entomologist studying the migration habits of butterflies would want to know about newly published scholarly papers but not about commerce sites selling butterfly-patterned wallpaper. Such profiling helps BioBot users deal with information overload, Frame said.

'BioBot's not meant to be the only way people can get in and use the NBII,' said NBII program manager Anne Frondorf, who works at USGS in Reston, Va. For people who want to browse around, the NBII home page links to sites with current topics of interest, such as amphibian deformities and invasive species.

Another part of NBII, the Metadata Clearinghouse, offers standards for describing biological data sets. NBII coordinates the standards effort with the Federal Geographic Data Committee and the National Spatial Data Infrastructure.

A related interagency effort, the Integrated Taxonomic Information System, seeks to standardize species names across scientific literature, databases and the Web. New species are being identified all the time, and it is important that they are not given conflicting scientific names, Frame said.

The Agriculture Department is the lead agency for ITIS, at www.itis.usda.gov. The ITIS database allows searching for species by scientific name, vernacular name or taxonomic serial number.

ITIS has been Web-enabled for some time, but USDA and NBII integrated it with BioBot only a few months ago, Frame said. Results of ITIS database searches include a link to BioBot, which subsequently can perform its own Web search if the user desires.

USGS is spending about $1 million per year on its contribution to NBII, which includes managing the Web site, Frondorf said.

Many of the participating agencies and organizations make in-kind contributions to the project.

inside gcn

  • IoT security

    A 'seal of approval' for IoT security?

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above