Archiving prototype is promising, NARA says

Archiving prototype is promising, NARA says

By Christopher J. Dorobek
GCN Staff

Just a year ago, the National Archives and Records Administration worried that it would not have the computing power to process the millions of electronic files coming its way at the end of the Clinton administration.

But the work of the National Partnership for Advanced Computational Infrastructure and the San Diego Supercomputer Center at the University of California seems likely to help NARA develop a system for handling the electronic onslaught.

The problem is relatively simple yet ominous: When President Clinton's term expires, NARA will receive more than 20 million files from the executive office. Using the tools available to the agency even as recently as a year ago, NARA predicted it would take 11 years to process the files.

Once NARA finished that task, it would have to turn around and start preserving them [GCN, March 23, 1998, Page 88].

'The government is increasingly generating large numbers of electronic records, such as e-mail messages, word processing documents and spreadsheets, which are treated electronically as individual files,' Archivist John W. Carlin said during a recent hearing of the House Government Reform Subcommittee on Government Management, Information and Technology.

'NARA has had no method of preserving and making these millions of files available,' he said.

But research has produced a prototype system that would let NARA preserve 1 million e-mail messages in two days, said Ken Thibodeau, director of NARA's Center for Electronic Records. Previously the most modern systems could handle fewer than 2 million files annually.

The university researchers' findings suggest that an electronic records archive could be built to preserve any kind of electronic record in a format that frees it from the system in which it was created, Carlin said.

'All this could be a major breakthrough in our search for an affordable system to access, preserve and provide electronic access to electronic records for the federal government,' Carlin said. NARA plans to continue reviewing the prototype, he said.

The system would let the archives preserve records across several types of technology. 'The only thing we have to change over time is the application programming interface,' Thibodeau said. 'We have to be able to interpret the Standard Generalized Markup Language tags, which are all in plain ASCII, to whatever we're sending it to. We don't have to bother going back in and reformatting the stuff.'

Such a process would not create access barriers and would let the archives use the best technology available, he said.

The system would use the Extensible Markup Language to capture the contents of documents and to create a file structure made up of document groups.

The prototype system is based on the concept that it is easier to manage groups of documents rather than millions of individual documents.

'It's very difficult to manage millions of individual objects. So what you really want to do is manage containers that have millions of objects inside of them,' said Reagan Moore, associate director for data-intensive computing environments at the San Diego Supercomputer Center and a professor of computer science at UCSD.

Marks up parts

The system would analyze documents to determine important parts and to learn a document's structure. Based on that information, the system would then mark up a document so it could track it and set its document type from a group of predefined categories.

'You must be able to preserve every object in the collection in itself and as a member of the collection,'' Thibodeau said.

NARA had researchers test the prototype on many types of documents, including databases, e-mail messages and digital images.

Stay Connected

Sign up for our newsletter.

I agree to this site's Privacy Policy.