In Good Order

High-level federal taxonomies and other data organizing resources


Congressional Research Service's legislative indexing vocabulary, at thomas.loc.gov/liv/livtoc.html

Genomics taxonomies, at www.genomicglossaries.com, maintained by Cambridge Healthtech Institute of Newton Upper Falls, Mass.

Library of Congress subject headings and classification schemes, at www.loc.gov/catdir/cpso/lcco/lcco.html

NASA's thesaurus of terms for indexing and retrieving documents, at www.sti.nasa.gov/thesfrm1.htm

Patent Classification System, a major taxonomy developed by the Patent and Trademark Office

Superintendent of Documents' classification system, known as SuDoC, at www.gl.iit.edu/govdocs/sudoc.html

Universal Data Element Framework home page, at www.udef.com

XML Working Group, at xml.gov

GCN Photo by Ricky Carioti

Jan Herd, a business reference librarian in the Science, Technology and Business Division at the Library of Congress, gained experience with modern forms of taxonomy from working on e-commerce projects.

Taxonomy puts electronic content in its place

Taxonomy can matter as much in an e-government project as roof trusses do in a building. The word is familiar to biologists and library scientists, and it means the same thing in IT: a hierarchical framework for organizing data.

Paperless government has brought the need for consistent, interoperable taxonomies to the forefront of IT. People use taxonomies on portals and search engines every day without realizing it, said Jan Herd, a business reference librarian in the Science, Technology and Business Division of the Library of Congress.

Ranking data by categories makes it easier to find, she said, and workers in fields ranging from medicine to patent law rely on categories originally developed by federal agencies.

Thousands of taxonomies exist, Herd said. They date back to the third century B.C., when the Greek scholar Kallimachos developed the first hierarchical subject ranking for the lost library at Alexandria, Egypt.

Herd said taxonomy is most often associated with 18th-century Swedish botanist Carolus Linnaeus, who developed a biological system for naming, ranking and classifying organisms.
To programmers, taxonomy has come to mean 'a high-level information search device constructed to understand and navigate intellectual capital,' Herd said.

Classification systems can be made up of words, letters, numbers or some combination. For example, the numerical codes of the North American Industrial Classification System are replacing the older Standard Industrial Classification system.

The Commerce Department and other statistical agencies developed NAICS partly because of U.S. trade agreements with Canada and Mexico and partly because many new types of businesses have sprung up since SIC codes were invented in the mid-1980s.

'There are new industries that couldn't be described by the old codes because they didn't exist' then, Herd said.

XML schema

In a database or search engine, users can perform queries without knowing anything about the underlying taxonomy. But taxonomy becomes more obvious and useful when it is implemented through an Extensible Markup Language schema, said Owen Ambur, a systems analyst at the Fish and Wildlife Service.

People as well as software can use the XML schema to associate metatags with specific content. The best taxonomy for e-government is one that makes sense to ordinary citizens, said Ambur, co-chairman of the CIO Council's XML Working Group.

The government's experience in replacing the SIC system with NAICS shows how taxonomies require frequent updating. They are 'a labor of love,' Herd said.

Information specialists are also finding taxonomies a useful structure for mapping or transforming data from one system to another.

The Census Bureau, which is organizing its 2002 Economic Census around NAICS codes, has published a bridge between the SIC and NAICS taxonomies at www.census.gov/epcd/ec97brdg.

Herd called the site a good example of data interchange. Mapping from one taxonomic system to another is easier with numbers or a combination of letters and numbers, she said. Mapping categories of words is a more subtle task.

For example, does the word stock refer to livestock, soup stock, warehouse stock or company stock? The Agriculture and Defense departments and the Securities and Exchange Commission might apply the same word to vastly different objects.

As agencies develop different flavors of XML for different user communities'law enforcement, e-commerce or human resources'mapping gets thornier.

Aerospace companies are rallying round the Universal Data Element Framework (UDEF) of unique alphanumeric identifiers for data elements.

Ron Schuldt, a senior systems architect with Lockheed Martin Corp.'s Enterprise Information Systems who spearheaded the development of UDEF, said the framework eliminates conflicts between legacy electronic data interchange systems and successors. The successors include ebXML, an XML variant designed for e-commerce, and a host of proliferating data standards.

The Aerospace Industries Association has adopted UDEF and is working with DOD officials on a metadata harmonization project, Schuldt said.

Several companies have developed software that automates some taxonomic indexing and categorization.

For example, QKS Classifier, a hybrid taxonomy platform from Quiver Inc. of San Mateo, Calif., combines automated categorization with human judgment. The platform's components are a categorization engine, a directory manager and an output interface that produces an XML feed for a portal or another application.

QKS Classifier is written in Java and runs under Microsoft Windows 2000, said Roz Chapman, Quiver's senior director of corporate marketing. Its administration tools let users import existing industry-specific taxonomies or create their own.

Other vendors of categorization software include Autonomy Inc. of San Francisco, Semio Corp. of San Mateo, Calif., and Vivisimo Inc. of Pittsburgh.

2-D mapping

Recently, some agencies and companies have experimented with visual mapping of categories in a 2-D picture instead of a linear hierarchy.

Herd said the National Library of Medicine used the 2-D technique for a visual representation of its PubMed database, at ubmed.antarcti.ca/start. The Web site uses VisualNet 2.0 software from Antarcti.ca Systems Inc. of Vancouver, British Columbia.

Despite the growth of automated categorization, Herd cautioned that artificial intelligence is too immature to make final taxonomic judgments without human input. She urged portal and Web site developers not to start classification efforts from scratch but instead to try to find an existing taxonomy that suits agency needs.

As a reminder that taxonomies cannot be set in stone, she said, look at the government organization manual'a rigorous taxonomy'and see how departments and agencies have changed over time.

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above