The promise and perils of Data.gov lie in the metadata
If metadata collection takes place after the fact, it will not succeed
Virtual-machine technology was tried many times before it took off. Most notably, it was tried in 1966 with O-code for Basic Combined Programming Language, then again in 1978 with p-code for the UCSD P-system. It finally gained widespread popularity with the Java virtual-machine language in 1995.
Similarly, Web applications have been around since 1993 with the Common Gateway Interface. However, only since the advent of Google Maps in 2005 have Web applications become the dominant programming platform.
The point is that some technologies take time to mature. That is the case with metadata catalogs — also known as metadata registries — such as the Environmental Protection Agency’s System of Registries and, potentially, the Obama administration’s Data.gov initiative. Like those other technologies, metadata catalogs have been tried in many forms, but only now are we beginning to see a convergence of tools, processes and principles to make them successful. We will illustrate a few of those principles by correlating EPA’s approach with Data.gov.
I recently attended EPA’s System of Registries Conference with some National Oceanic and Atmospheric Administration employees with the hope of capturing and transferring lessons learned. EPA is using an approach to metadata collection in which multiple registries cooperate to create a cohesive whole. For EPA, the registries cover systems, services, substances, components, terminology and facilities. They work together and reinforce one another as a single, interrelated enterprise resource.
Data.gov also cooperates with existing agency Web sites and catalogs instead of trying to host everything. In that capacity, it acts as a registry of registries or a metadata traffic cop for the federal government. The principle is that metadata collection can mature and evolve over multiple iterations as organizational processes, systems and terminology change. The notion reflects the nature of metadata and leads us to our next topic: metadata design.
As I detailed in a previous commentary, titled “Metadata’s new name is TED,” metadata design involves understanding how metadata represents Targeted Extrinsic Description. The process involves capturing descriptive characteristics about something — such as an information technology system or process — that assists in its use or management. A system of registries is a good example of metadata design in that each registry has a specific target or focus. By not trying to develop a single, generic metadata registry, you avoid bloating the number of characteristics you attempt to collect.
The second key principle involves targeting your metadata design. One of the best ways to target a registry is to focus on consumer-oriented outcomes such as discovery. It is typically expressed in answering a question such as, “What authoritative data source (or IT system) stores data on Chemical Substance X?” We see the Obama administration tentatively testing that principle by actively seeking feedback from users about what they want from Data.gov. That is a Web 2.0 method for determining what consumer-oriented outcomes to focus on and will certainly put the wisdom of crowds to the test.
Besides the promising indicators of a holistic, system-of-registries approach and targeted metadata design, there is also convergence around the best practices for avoiding the pitfalls of weak governance and integration. Metadata registries must be integrated into everyday business processes, and IT-centric metadata must be integrated into the system development life cycle.
I have seen metadata registries fail because that principle was ignored. If metadata collection takes place after the fact, it will not succeed because it will not accurately reflect the current state of the organization. That is probably the most serious peril facing Data.gov.
Additionally, a system of registries will require a single overarching metamodel to ensure that the registries actually cooperate and act as a single system. What we don’t need is to replace data stovepipes with metadata stovepipes.