Another View: Can government integrate all its databases?
- By Katherine Hammer
- May 30, 2002
Interagency data integration is a top priority now, particularly among intelligence agencies. But the data is distributed across a staggering number of databases on heterogeneous platforms.
An example of such a massive governmental data integration project would be Australia's Centrelink. In 1997, 40 data systems in 14 agencies were integrated into a data warehouse that is currently growing at a rate of 2.5T a week. One source consisted of a file with 1.25 billion records, each containing 3,000 fields.
But Australia has only 19.4 million citizens; the United States has 285 million. A similar data warehouse for benefits information from, say, the Social Security Administration, Medicare and the Veterans Affairs Department would exceed the capacity of even the most powerful commercially available database program.
Human services information is comprised of fairly straightforward transactional data, such as births, deaths and applications for benefits. In this domain, it is relatively easy to define a meaningful event. Intelligence agencies capture far more complicated data. They gather vast amounts of random, proprietary and unstructured information, hoping some pattern will emerge. Necessarily, these agencies must access the data through proprietary interfaces and reporting systems.
For years, industry has been developing technology to obtain and analyze such data from e-mail, satellite images and speech. But the integration of this technology with commercial database management systems is still in its infancy.
New Extensible Markup Language products use rules to index and store XML documents for rapid retrieval using search engines. Some large database vendors are adding this capability to their core products.
But these products are far from complete solutions. The effectiveness of the search capability depends in part on the rules used to create the indices. Specifying these rules will require agencies' programmers to have expertise in the product and the domain itself.
Finally, unstructured data captured in intelligence systems is multimedia, including audio, graphic and text files. It could take several years for multimedia databases to mature sufficiently to provide both the performance and sensitivity required to execute complex queries.
Legal and security issues also impede integration. For privacy or security reasons, some agencies are prohibited by law from sharing information. Yet, as investigations into pre-Sept. 11 events have shown, it is critical to develop systems for the sharing of information across agencies, regardless of bureaucratic agendas.
Given the challenges, the best means of interagency data integration is likely a virtual warehouse. It would need to let agencies characterize the information they seek without making them privy to the underlying structure or data gathering process.
The users of this virtual warehouse would have a query tool to pose questions against an abstract data model'that is, a logical view of the data'rather than probing the actual database.
The virtual warehouse would translate the query, triggering actions to extract, consolidate and present data from government systems. Before this architecture can successfully address the technical issues, the developer community must establish a number of standards for database content, authorization and security. Katherine Hammer is co-founder and chairwoman of Evolutionary Technologies International of Austin, Texas.