NSF funds database gymnastics
- By Joab Jackson
- Aug 13, 2004
The National Science Foundation has the urge to merge, at least when it comes to databases.
Commercial middleware products have not proved flexible enough for the full range of federal, state and local needs, said Sylvia Spengler, NSF program manager for science and engineering information integration and informatics.
Agencies need faster ways to 'hook into a database on the fly, grab the data and move on,' she said at a recent conference on digital government.
The agency's Middleware Initiative is currently funding 31 projects, collectively worth about $25 million, that aim to reach beyond commercial offerings. The agency's Directorate for Computer and Information Science and Engineering oversee the grants, and other directorates also have middleware projects under way.
At the conference, researchers from the University of Southern California showed how, using a Web services architecture, workers could query 11 databases through one interface.
Once an application is exposed as a Web service, other applications can easily draw from its resources, at least in theory.
To estimate how much freight traffic travels over Los Angeles freeways daily, for example, the researchers wanted to draw data from separate databases on everything from airport and port traffic to units of measurement, said Jose Luis Ambite, senior research scientist for the university's Information Sciences Institute.
Using open-source Prot'g'-2000 software, the team created an ontology, or list, of all the elements in each of the databases, and their relationships. Then they mapped out a complicated workflow for bringing in and manipulating the data. Once all the data sets were described in a uniform fashion, additional workflow models could be built on the fly.
Researchers from Virginia Polytechnic Institute and State University and Purdue University demonstrated how a state could offer multiple online social services through one application, even when the services involve different systems.
For a prototype, the developers borrowed historic data samples from Indiana's Family and Social Services Administration. Social services are fragmented across different Indiana departments, Virginia Tech's Athman Bouguettaya said, and case workers rely on phones and fax machines. Citizens applying for more than one service must travel from agency to agency.
The prototype bundles all the online social services through one applet, WebDG, which serves two sets of users.
Case workers see an expert menu, whereas citizens use Web portals, but both draw from the same databases.
The Virginia Tech-Purdue team built an ontology of multiple taxonomies, each describing a set of services. Because the Indiana data came from three dissimilar relational database management systems'Oracle, IBM Informix and ObjectStore from Progress Software Corp. of Bedford, Mass.'it had to be accessed through a Common Object Request Broker Architecture server, which encapsulates each database as a CORBA object.
If agencies develop taxonomies for the information in their databases, they can use ontologies to merge the data sets across different databases, the researchers said. But mapping an individual taxonomy to a large, agencywide ontology is time-consuming work.New mapping tool
So, with NSF funding, researchers at the University of Wisconsin have developed a tool called Agreement Maker, which automates the mapping of taxonomies to ontologies.
The researchers wanted to query geographic information from Wisconsin's Land Information System, a collection of local and county geographic information system servers.
Each jurisdiction has its own table and field names, or schema. That means users seeking data from multiple databases would have to download each data set separately'a lengthy job, said Nancy Wiegrand, a Wisconsin researcher.
The group fused the data sets by putting all of the elements within a single ontology using an automated tool. The tool searches for similarities among hierarchical lists of terms and presents the most probable matches.
Researchers at the University of Southern California and Washington University of St. Louis also are developing algorithms to automatically match columns across different databases, using a technique borrowed from machine translation.
The work started as a way to aggregate Environmental Protection Agency air quality databases from 35 districts. The re-
searchers wanted to match up database columns holding the same types of data, but under different names.
Over a period of four months, the group wrote a series of algorithms to compare columns in different databases. Data sets such as street addresses, phone numbers and ZIP codes are similar across databases, even if the column headers have dissimilar names.
'It works surprisingly well,' said Eduard Hovy, one of the California researchers.