This data warehouse creates a virtual Noah's Ark
TUCSON, Ariz.Under the banner of the National Gap Analysis Project,
the Geological Survey has for the last 12 years began mobilizing an army of biological
detectives to discover plant and animal species that are endangered or are about to be,
that should be thriving but arent.
Their detection tool is a data warehouse.
To work as USGS envisions, the warehouse needs detailed maps showing the distribution
of vertebrates and plants in the United States. Each map needs multiple layers showing not
only data about animals and vegetation but also about the land: latitude, longitude,
elevation, climate, rainfall and temperature range. Each state has its own Gap project.
The Arizona Gap warehouse draws from an unprecedented range of sources: century-old
handwritten cards about museum specimens; more than 100 databases from state fish and game
departments and from federal departments such as Defense and Agriculture; Global
Positioning System readings taken by University of Arizona graduate students, and even
information from elementary school students.
USGS Biological Resources Division, which coordinates the project, is
piggybacking its research with the states.
The universitys School of Renewable Natural Resources is home to Arizona Gap.
Graduate students in the Advanced Resources Technology (ART) program study geographic
information systems in cooperation with USGS ecologist Michael Kunzmann. Graduate students
generate some of the new data as well as normalize or transform data in legacy databases.
On the Web at http://nbii.srnr.arizona.edu/nbs/gap/gapdata.html,
citizens, researchers and policy-makers can ask questions, search the metadata repository
and retrieve information from more than 100 databases maintained by state, federal and
In the works is software to build a user profile database that will let individuals
subscribe to specific information. As new information is added or old information is
changed, the software will search across libraries and return to users only data that
affects the information in their profiles.
One of the biggest problems in building such a vast warehouse, Kunzmann said, is
theres not enough money and not enough scientists, and we cannot work fast
enough to monitor all the standards and keep up with changes in land cover.
Part of the solution, he said, is to get everyone from scientists to kindergartners to
contribute to the effort. Even the most basic information about animal sightings is
useful, Kunzmann said. It tells us where to look for the animals.
The resource suppliers contribute data via the same Web site that they visit to
retrieve data. A software harvesting agent presents them with a list of queries and
determines who they are, whether first-time contributors, students or peer-review
scientists. The agent keeps a user profile database. It asks what kind of data they want
to submit and which database the new data would augment.
Other harvesting tools maintain quality control by determining whether the data is in
range, compared with existing data.
If the form is clear enough, contributors can even create the metadata themselves,
The new data goes to a database of harvested resources, maintained in the ART computer
laboratory. It also goes to a broker knowledge base on Gaps Sun Microsystems Ultra
Enterprise 4000 parallel server with 4G of RAM and 38G of storage, where Kunzmann and his
grad students fine-tune old models and create new ones.
Each database supplier maintains its database and can draw newly harvested data from
the Gap store. That part of the process is not automated, however. You need a person
trained in library science and the disciplines, such as ornithology, to make
judgments, Kunzmann said.
Databases from some suppliers, such as DOD, are generalized to different levels,
depending on the user. A fairly fuzzy level of data is freely available to all. Particular
versions are available to a list of contractors or agencies supplied by DOD.
Maps showing archaeological sites and endangered species are similarly limited.
Problems have arisen, for example, from release of nesting information about peregrine
Database owners now must request the newly harvested information but eventually will be
able to subscribe.
Each of the supplier databases is replicated on Arizona Gaps Web and GIS server,
a Sun UltraSparc 60 with 256M of RAM and 37.2G of storage.
Getting the information out to users who have ordinary telephone lines can be
We did tapes, but the formats keep changing, Kunzmann said. We
dont know whether were dealing with 4-mm or 8-mm or quarter-inch
cartridge, or what operating system is involved.
Two graduate students work half-time to transform legacy data sets and update the
metadata to comply with Federal Geographic Data Committee metadata standards set in 1994.
Our biggest problem isnt hardware or software, and it isnt how easy
the data is to retrieve, Kunzmann said. If you dont have good quality
control and good metadata, you cannot rely on the quality. The whole thing is
Metadata even covers such seemingly marginal information as what vehicle the data
A researcher on horseback in a national park or forest would find more grassland
birds than one in an automobile because the horses would flush the birds out,
Data also is subject to constant change. When ornithologists recently split one bird
species into two species, each instance of the brown towhee and its Latin name had to be
changed in each database to the canyon or California towhee.
Some legacy data sets were produced in programs such as the Geographic Resources
Analysis Support System, a Unix GIS tool from the Army Corps of Engineers, and Idrisi, a
GIS created at Clark University in Massachusetts. Others were created in or translated
into formats compatible with Arc/Info for Unix or Microsoft Windows NT from Environmental
Systems Research Institute Inc. of Redlands, Calif.
You can translate from one format to another, Kunzmann said, but you
dont get the same look and feel, and theres not necessarily a one-to-one
relationship in whats presented and how.
When a user goes to the Web site and queries the warehouse, a computer tool applies a
set of rules to determine the best algorithms to retrieve the data.
The tool works like the nearest-neighbor algorithms used in library programs, which
take into account misspellings and other irregularities, he said.
For example, a user might ask for plant communities. The tool might respond that it
found no plant communities but did find vegetation communities. Or a user might ask for
imagery about a particular bird, when what he wanted was a photo of it. The software would
ask whether he wanted satellite imagery or photographs.
A query triggers a Common Gateway Interface script that locates the data sets, starts
ESRIs Arc/View 3.0 for SunSoft Solaris with Spatial Analyst and other ESRI
extensions, and loads the data sets. Images then go back to the requester.
We cannot count on everyone having the software, Kunzmann said, and there
is no funding to buy licenses needed to let users manipulate data online.
When a user requests entire data sets, it triggers another CGI script, which uploads
from the correct database to a File Transfer Protocol site on the Web server. The user
then downloads the data from the FTP site.
If our budget was based on how much information we give away, Id have a
great budget, Kunzmann said. Data sets can be as large as several gigabytes and take
hours to download. The ART labs 100Base-T LAN connects to the universitys
fiber-optic 100Base-T WAN and uses its three to four available T1 lines.
Early this month, the Gap Web server hard drive crashed irretrievably. The data was
safely backed up on tape, but not the operating system or the CGI scripts or the
subdirectories, Kunzmann said.
Because there is no money for a mirrored server, Kunzmann spent the two weeks after he
got a new drive restoring the operating system, scripts, subdirectories, software and
Lack of hardware is a problem, he said. The warehouse has grown huge and slow, and the
demand is fast outgrowing the ART labs resources. Lack of staff is pinching even
Average service time for a grad student is nine months, Kunzmann said. He can pay $180
per week, but industry will pay $50,000 per year.
The project is fulfilling its missionhelping preserve plants and animalsbut
its also fast becoming a crucial tool in land planning.
A yellow area might call for informal consultation. A red area would mean,
Youre going to have to have a Section 7 [document], a consultation with the
Environmental Protection Agency, and therell be public hearings, Kunzmann