Formatting the future

Agencies are OK with these formats

The Electronic Records Management E-Gov Initiative, overseen by the National Archives and Records Administration, has started specifying formats that are acceptable for submitting records to NARA for long-term archiving. In addition to these formats, NARA plans in the near future to designate additional acceptable transfer formats.

  • ASCII (for text): NARA accepts text documents, Web pages and e-mails in the American Standard Code for Information Interchange. The ASCII standard, which has been around for more than 40 years, was created to represent the English alphabet, numerals and selected special characters. ASCII contains no methods of representing how to display characters'such as which fonts or typesets to use'and can be read by pretty much all word processors and browsers, as well as many other applications.

  • Geography Markup Language (for GIS records): GML is an Extensible Markup Language-based format for geospatial data records used in geographic information systems, overseen by the Open Geospatial Consortium. NARA accepts records in versions 2 and 3.

  • JPEG (for images): Joint Photographic Experts Group's File Interchange Format is used for capturing images. The International Standards Organization recognizes JPEG as a still-image standard. Not yet recognized by NARA but generating interest in the archiving community is the JPEG successor, JPEG2000, which reportedly offers better compression.

  • Portable Document Format (for documents and forms): Adobe PDF is used to capture paper documents in an electronic format, although electronic-only documents can be created with PDF as well. PDF maintains the original look-and-feel of a document, regardless of what computer platform it is opened on. NARA accepts PDF documents in versions 1.0 - 1.4, and asks agencies to turn off all security settings before submitting documents.

  • TIFF (for images): The Tagged Image File Format encodes bit-mapped images, although extensions exist for character recognition as well. Adobe holds the copyright to the TIFF specification. NARA accepts images in TIFF formats 4 through 6.

  • Spatial Data Transfer Standard (for GIS data): A format for representing Earth-referenced data, SDTS is recognized by both the Federal Information Processing Standards and the Federal Geographic Data Committee, a 19-member interagency committee developing policies for geographic data use. The Geological Survey makes heavy use of SDTS.
  • 'Tape is a devil, but it is a devil we know. We know the vulnerabilities, and most of them can be managed.'

    'NARA's Ken Thibodeau

    Rachael Golden

    To keep documents accessible, agencies face critical choices on software and hardware

    Five years ago, U.S. Courts started putting in place an electronic docket-filing system. It would contain records to be kept'and accessed'for decades, if not indefinitely, and that forced project managers to make some tough decisions on electronic formats.

    The courts decided on the Adobe Portable Document Format, for two reasons, according to John Brinkema, a senior research computer scientist at the Administrative Office of the U.S. Courts.

    First, PDFs preserve the look and feel of the original paper document, an important quality because legal documents frequently make references to other pages within that document or to pages in other documents'even if those records are in electronic form only.

    Second, and more important, Adobe Systems Inc. has published the specifications for reading a PDF document. Should the company ever go out of business, the records could be accessed using other software written to Adobe's specifications.

    These days, U.S. Courts has more than 2 billion electronic records in PDF format, spread across almost 200 locations around the country. The federal judiciary is ahead of many agencies in establishing an electronic-records management process.

    'Rather than waiting around for the rest of the government to do things, we just did it,' Brinkema said.

    When it comes to saving electronic information for the ages, the challenge of choosing the appropriate format is formidable.

    A format specifies how to encode a set of data so that it can be accessed by people or other machines. Every software vendor uses formats to encapsulate the data that is generated in its programs.

    But given the volatility and constant changes in the IT industry, the formats an agency chooses today might not be around in 10 or 100 years. Horror stories abound of suddenly vital documents locked away on some early, now unreadable, version of WordStar.

    'It is hard to preserve digital information without a clear guide to how the information is encoded within a format,' said William LeFurgy, a project manager for the Library of Congress' Digital Initiatives program.

    Compounding this problem is the fact that hardware used to read these formats may also disappear. Who still has equipment to read 5'-inch floppy disks or punch cards?

    LeFurgy said the library's Digital Initiatives program considers a number of factors when considering whether to hold on to a format for long-term use.

    One is the proprietary nature of the format. 'This is a major problem for many commercial software products'the specification is hidden as a business secret, which results in a format whose information content cannot be decoded without using the original proprietary software,' LeFurgy said. The library will not rule out proprietary formats altogether. But any used must have publicly disclosed specifications, such as Adobe's.

    Still, even supposedly open file formats can contain traps. The PDF format, for instance, can be extended to include JavaScript, audio files, images, special fonts and even videos'all of which may or may not be encoded in an open format. For this reason, in 2002 the U.S. Courts started work on a subset of the PDF specification, called PDF archiving, or PDF/A.

    The idea, according to Melonie Warfel, director of worldwide standards for Adobe, is to have a subset of the PDF specifications that is restricted only to completely open standards.

    Aside from functionality, agencies should also consider a format's popularity, LeFurgy said. There's a good chance that documents written in Microsoft Word will be accessible for quite some time, simply because it is so widely used today, and readers will be in demand for some time to come.

    One piece of good news for agencies is that the National Archives and Records Administration has started specifying which formats other agencies should use to submit their records to NARA.

    Yet another factor to consider is the complexity of a format's encoding process, LeFurgy said. Compression schemes used to reduce the size of a record, or encryption schemes to secure a document, could be particularly problematic for future archivists, who might not have access to those algorithms.

    Though it requires greater amounts of storage space, keeping records uncompressed is also a smart move in preserving the fidelity of images and audio files, said Charles Fenimore, Motion Image Quality project leader for the Digital Media Group of the National Institute of Standards and Technology.

    Fenimore's research team is finding that converting imagery or sound from one compressed format to another always results in additional loss of quality, which can be problematic as older data gets moved to newer formats, he said.

    In addition to file formats, agencies must also worry about the formats of the physical media itself'the tapes, disk drives and optical disks that contain records. These, too, are vulnerable to rapid obsolescence.

    While tape is considered the electronic medium that lasts the longest, it is not immune to failure. Kenneth Thibodeau, director of NARA's Electronic Records Archive program, has heard of rare cases where an entire library of aging tapes suddenly started failing en masse.

    'The chemical processes of the manufacturing processes are such that a batch of tapes could self-destruct in a matter of months,' Thibodeau said.

    NARA keeps its permanent records on two copies of tapes, each in a different location. To guard against failures, the agency each year tests a sample of tapes to assure they are still stable.

    Optical questions

    'Tape is a devil, but it is a devil we know. We know the vulnerabilities, and most of them can be managed,' Thibodeau said.

    Agencies are increasingly using optical disks for archiving, though the jury is out on how long the media can last, given that optical disks have only been in use for the past 25 years or so. It's an area that Fred Byers, an IT specialist at NIST, is investigating.

    Byers said he thinks that disks could last for over a century, if kept in environmentally friendly conditions. What concerns him, though, is the fluctuating rates of quality control in the manufacturing processes, which lead to variances in how long disks can last. He has started working with the Optical Storage Technology Association to develop an industry archiving standard.

    If manufacturers adhere to quality control specifications that the OSTA working group is developing, they will be able to put a seal of approval on their products, indicating that the disks should last for a set number of years.

    In the end, though, agencies must assume that whatever media they use will be obsolete sooner or later. So they should develop a long-term strategy of periodically updating their files to whatever media is current, officials said. In other words: think of archiving not as a process of putting records on storage media, but rather as a process to preserve records independent of whatever physical media is used. This is the strategy both NARA and the Library of Congress are taking.

    'It is a given that we will be moving digital content to and from many kinds of media as part of our ongoing management and preservation function,' LeFurgy said.

    NARA has had a storage migration plan in place since 1971, Thibodeau said. The agency will pick a storage media that it can trust to last a specific stretch of time, and develop a process of moving the records off that media when that time period ends.

    Thibodeau likens the archiving process to a funnel, one that takes in many formats and converts them all into a standard output format.

    'The first thing we do when a record comes in is that we copy it from whatever media it is onto the standard media, and to a standard physical format,' Thibodeau said.

    At the end of that lifecycle, the agency can easily automate the transfer of those records to the new media.

    'It becomes a production process,' Thibodeau said.

    The Archival Preservation System now handles those duties. But it will be replaced by NARA's Electronic Records Archive, which will be more suited for handling submissions through the Internet.

    An important aspect in migration is that agencies must maintain a record's authenticity. Electronic records could be modified less conspicuously than paper records.

    'What is important is the ability to preserve those records in an authentic manner, so that it is incontestable if they were to go to court,' said Tom Kelley, a customer engagement manager for Lockheed Martin Corp.

    Lockheed Martin is one of two companies NARA chose'the other is Harris Corp. of Melbourne, Fla.'to build ERA prototypes. In any system, agencies must be able to establish a chain of custody leading back to the original to prove the record in question remains authentic, despite any number of transformations.

    'We would keep traceability back to the original submittal, and any chain of transformations that would happen,' Kelley said.


    • Records management: Look beyond the NARA mandates

      Pandemic tests electronic records management

      Between the rush enable more virtual collaboration, stalled digitization of archived records and managing records that reside in datasets, records management executives are sorting through new challenges.

    • boy learning at home (Travelpixs/

      Tucson’s community wireless bridges the digital divide

      The city built cell sites at government-owned facilities such as fire departments and libraries that were already connected to Tucson’s existing fiber backbone.

    Stay Connected