IG: NARA e-records archive missing vital Google-like search capability
Editor's note: This story was modified after its original publication to clarify information.
People trying to search the text of documents through the National Archives and Records Administration’s $430 million Electronic Records Archive are going to be disappointed, according to the agency’s inspector general.
Under the currently deployed system, users can search only by metadata. That typically includes tags for information such as name of the original publication, date of publication, agency that originated the document, and a small number of keywords. Users who hope to locate a document by a word or phrase that isn't part of the metadata will be unable to.
NARA launches search and archiving system
NARA takes crowdsourcing approach to tagging historic documents
The public’s ability to use the ERA is likely to be hampered because of the lack of a full text-based search capability, which would be similar to what is available on Google.com or other commercial search engines, NARA Inspector General Paul Brachfeld said in an interview Oct. 26.
Lack of full text search “is one of the profound problems with the ERA at this point,” Brachfeld said. “Metadata alone does not tell the story of what is in the documents.”
Brachfeld recently released online copies of two management letters he previously wrote in January and May of 2011 to Archivist of the United States David Ferriero on the inadequacy of search tools in the ERA.
The agency has acknowledged some of the limitations with the ERA. It completed its $430 million system development contract with Lockheed Martin in September and did not extend it for an optional year. It has hired IBM to maintain and operate the ERA on an annual contract valued at $243 million over 10 years if all options are exercised.
Under the new contract, the agency will encourage IBM to try to enhance the system to add text search capability, but it is not clear whether that capability would be permitted under the current system architecture, or if the costs of the additional capability would be prohibitive, Brachfeld said. Furthermore, he added, adding the full text search capability at this time may interfere with protections for personally-identifiable data.
“It is built into the contract to try to address the full text search capability,” Brachfeld said. “I am not sure what they can do.”
Lack of text searchability “ is an important weakness, and I am not sure it can be corrected,” he said.
Brachfeld said the flawed system was poorly designed by a succession of managers, many of whom have left government. "The program has had problems since its inception under then Archivist of the United States John Carlin," he said. There have been three successive US archivists in charge since Carlin's tenure, he added.
Throughout the multi-year program, the inspector general’s office continued to ask about search capability, he said.
The office asked “fundamental questions of ERA program managers, employees, contractors and senior NARA officials. The most basic being, ‘At full operational capability, will the common citizen be able to effectively access and research the electronic records they are entitled access to over the Internet?’" Brachfeld wrote in the May 4 management letter. “We believe the answer, with limited caveats, is no.”
Brachfeld also warns that because of limited search capabilities, severe bottlenecks are likely to develop in screening documents for entry into the ERA because of the need to identify and remove classified information and personally-identifiable information.
While agencies are not supposed to send classified information to the ERA, it is likely that screening will be needed to ensure that classified information does not appear, and that may cause slowdowns of the system, he suggested.
“If one imagines ERA as a busy six-lane highway moving an immense amount of traffic, this part of the ingest procedure is akin to closing five lanes for a stretch. While the rest of the highway remains capable of transporting all the traffic, the back-up or bottleneck caused by that one stretch makes it impractical to use the road,” Brachfeld wrote.
Alice Lipowicz is a staff writer covering government 2.0, homeland security and other IT policies for Federal Computer Week.