Records management takes a few lessons from supercomputing
- By Joab Jackson
- Sep 22, 2004
Storing and categorizing data has parallels elsewhere in government computing.
A collection of multilayered geospatial files, for instance, bears a striking similarity to a large collection of DNA strands, researchers say. It's also not unlike managing the huge volumes of data processed by supercomputing programs.
Such unlikely similarities have spurred the National Archives and Records Administration to partner with the National Center for Supercomputing Applications to investigate how scientific data management tools could be used in records management.
The archivists are hoping to pick up some techniques in large-scale data handling from the managers of big iron.
'Scientists have to organize complex data. So we are asking ourselves what are some of the techniques and methods in organizing scientific data applicable to this domain,' said Michael Folk, who was project director for the first phase of the program.
Robert Chadduck, NARA's director of research, initiated the project, titled Applications of Scientific Data Management Tools and Automated Classification Techniques for the Management of Electronic Records. For the first phase, NARA awarded the supercomputing center a $285,000 cooperative agreement for six studies.
The studies looked at distributed records management, which formats were suitable for long-term archiving and holding large amounts of data, ways of accelerating data intake and output, and ways of automating the process of categorization.
'The National Archives culture is not a high-performance computing culture,' Folk said. He described the process for accepting material into the archive as human-based. 'Somebody makes sure [a document is] accurate, marks it up and puts it in.'
The difficulty is that the sheer volume of material threatens to overwhelm the archivists. Fortunately, managing vast amounts of data is 'an area where perhaps we have some experience that we can bring to bear,' Folk said.
'We're always looking for ways of making reading and writing data faster,' Folk said.Open format
Formats are another issue under scrutiny. In one study, Folk and Bruce Barkstrom of NASA's Atmospheric Sciences Data Center studied applying data formats used by the scientific community to long-term records management. At present, they are looking at standards such as NCSA's Hierarchical Data Format, an open format widely used for managing scientific data.
One principle gleaned from their work is that formats must be very well defined to hold up over time. In long-term data storage, commercial software might not be available to view files in a particular format. So it's important, the re-searchers concluded, that formats be defined in 'a sufficiently rigorous way,' so that software can be recreated to read the files.
'Perhaps the most useful way to improve data formats for long-term, persistent access is to place their structure within a rigorous mathematical structure,' they wrote.
The second phase of the project, now under way, is overseen by Michael Welge and Bill Bell, both of NCSA. These studies will look at aspects such as visualization and analysis, data management and performance measurements.
Joab Jackson is the senior technology editor for Government Computer News.