To tag or not to tag
- By Joab Jackson
- Jan 05, 2006
The issue is what do you want to get out of the search?'
Brand Niemann, Federal Semantic Interoperability Working Group
Metadata is associated to some piece of data or information and tells us about its characteristics.'
Eliot Christian, former chair, Ficgi
J. Adam Fenster
Last fall, federal content managers in Washington buzzed over the issue of metadata. Should agencies take the trouble of including metadata with the content they produce? Or is it unnecessary, a quaint practice to be rendered obsolete by the exponentially growing power of computer processors? It's a question that will resonate throughout 2006.
The source of all the excitement was a request for information issued jointly by the Office of Management and Budget and the General Services Administration. The RFI queried industry about the current state of search technology. In particular, it raised the question of whether metadata tags for electronic documents were really necessary. Specifically, the RFI asked, 'Does current search technology perform to a sufficiently high level to make an added investment in metadata tagging unnecessary in terms of cost and benefit?'
The responses were mixed, to say the least. Approximately 56 percent of respondents saw no need for preparing material for search engines, according to a GSA summary of the RFI results issued in December. The commercial search engine community is confident that its products can answer queries without hours of manually cataloging the material ahead of time. Call it the brute-force approach; let big machines ingest everything and then match it to your needs on the fly. And with Moore's Law upping processor power at a prodigious rate, search may well exponentially improve as time goes on.
Forty-four percent of the RFI's respondents, however, felt differently (respondents to the RFI included industry experts, agency officials and academics). They insisted that documents need to be annotated before they are posted for public consumption, either by adding tags identifying their context, creating controlled vocabularies for their classification or by manually cataloging the items.
'Metadata can improve search, if it's done using a structured vocabulary by trained experts,' said Chris Sherman, associate editor for the Search Engine Watch Web site. 'We're seeing computers increasingly be able to do categorization automatically, rather than relying on human taxonomies.
'There are compelling arguments on both sides, and I'm sure we haven't heard the last of them,' Sherman said.Data about metadata
'Metadata is associated to some piece of data or information and tells us about its characteristics,' said Eliot Christian, former chairman of the Federal Interagency Committee on Government Information.
In the precomputer days, knowledge seekers used the metadata in card catalogues, which provided details about what sorts of information a volume might contain.
Today, metadata can be as simple as the statistics automatically appended to a file by an application, such as the time the file was created. Or it can be manually generated by employees describing documents they create. All of this metadata can help you, or your computer, find things faster.
The downside to metadata is that it can be expensive to create, at least the kind that has to be created manually. Each element must be classified, described and cataloged, which requires extra time and effort on the part of government employees.
With powerful search engines such as Google, do agency officials need to spend extra time marking up every document they produce?
'The issue is what do you want to get out of the search?' said Brand Niemann, co-chairman of the Federal Semantic Interoperability Working Group. 'No one argues that the more metadata you have, the better the search results you will get.'
But there's an inherent trade-off between the cost of tagging and the value of results. The RFI included a number of hypothetical examples of searches, most of them quite complex. They involved searches that consult a wide range of source documents, from geospatial data, images, audio, Web sites, e-mail and plain documents.
One scenario, for instance, involved an individual trying to track the flow of information as it goes from one agency to the next, noting how accurate and timely that information was.
Niemann said that he doubts that the scenarios GSA posed could be carried out without some form of metadata tagging. His group is in the process of submitting an informational paper to GSA. 'What they are looking for is not search, but knowledge computing,' Niemann said
Others are not so sure metadata is entirely necessary for good data searches. Except in special cases that involve something like images or videos, metadata tagging is 'nice-to-have,' but not essential, said Jerome Pesenti, chief scientist for Vivisimo Inc. of Pittsburgh.
GSA recently awarded Vivisimo a contract to run the FirstGov search service. Vivisimo will use the MSDN search service from Microsoft Corp. to scan government sites. Vivisimo's own clustering software will then take the results and bundle them into subject groups on the fly, making it easier for users to find the needed results.
'Taxonomy building and metadata generation is expensive and laborious,' said Raul Valdes-Perez, Vivisimo chief executive officer. Valdes-Perez said he has seen many metadata projects get bogged down and never completed.
Vivisimo looks at ways to extract metadata-type information inherent in the documents themselves. For instance, titles of documents are a key piece of information that describes what a document is about.
Also the document's location can provide context. Is it on a Web site of an academic institution? What is the name of the directory that the file resides in? All can provide useful context.
The Internet search engine firm Google Inc. of Mountain View, Calif., does not require metadata tagging, though the company encourages Web site owners to add contextual keywords to their Web pages for more accuracy. Google evaluates the relevance of each page to a given search based on the number and type of other pages that link to that page.
Officials from at least one government search service agree with this non-interventionist approach.
The Energy Department's Office of Scientific and Technical Information, which hosts the Science.gov search service, said that 'search technology has progressed far enough so that manual categorization and metadata tagging of textural documents is no longer necessary and any perceived gain in accessibility does not justify the cost of categorization.'Beyond searching
While Google's approach works well with the reference-heavy World Wide Web, it has not worked so well for government information in the past, Christian said.
Government agencies have a lot of internal links'from one agency page linking to another page within that agency site'which Google's algorithms tend not to handle effectively.
Mike Frame, deputy center director for the Geological Survey's Center for Biological Informatics, demonstrated the shortcomings of Web-styled searches at the most recent meeting of the federal Web Managers Advisory Council.
He compared the results of the query term 'alien species' when entered in the Google search engine to those returned by a government metadata-driven search engine, the BioBot Search tool, which indexes the National Biological Information Infrastructure, a collection of U.S. biological-related information. The term refers to animals that are not native to a geographic area but are placed there by human intervention.
In this comparison, Google did indeed return a cornucopia of pages about alien species. BioBot did as well, but also returned pages about 'invasive species,' a term referring to a similar, though not identical biological phenomenon.
Because BioBot consults a thesaurus of biological terms, it was able to return results of related concepts, not just those results that matched the terms exactly.
Frame concluded that while commercial search tools provide a useful means to finding general information, as the searches rely more on pre-existing contextual frameworks, they lose viability.
A search engine relying on unstructured information cannot interpret the context of the user's search. It knows what a document contains, not what it is about.
In response to the RFI, the National Archives noted that metadata could also provide essential pointers to the quality of the data, something the data itself can not convey.
'For either information or records to be trustworthy, they must have additional information either embedded within the content itself or information associated with the content that can provide some degree of assurance of authenticity, reliability and integrity, now and in the future,' the National Archives wrote.
Metadata can also play a role in helping computers themselves sort out data, a task some of the more advanced scenarios in the OBM/GSA RFI alluded to.
Today there's considerable development going on in the field of inference engines and other forms of semantic-based software. Web services companies are looking into how an application can infer new knowledge, in effect refine its task according to the metadata it finds.
To accomplish this kind of advanced thinking, 'the metadata has to be machine-processable,' Niemann said. In other words, good metadata could help the government better use these new tools when they hit the marketplace.
'We're talking about something way beyond conventional tagging on HTML pages,' Niemann said. 'We're talking about repurposing content that it is semantically computable.' In other words, it appears metatagging won't go away just yet.