Machine learning digs into states’ archives
Amid growing backlogs of archival data, states are turning to machine learning to streamline records management.
Before digitization became the norm, one paper memo may have been copied just a handful of times. Now, as electronic records are replacing their paper counterparts, one copy can easily become hundreds, creating a records management challenge, said Kristopher Stenson, state records manager at the Oregon State Archives.
Besides the multiplying number of digital-born records, the software and hardware meant to store and maintain records grows obsolete over time, making it difficult to ensure a document’s associated provenance and other metadata is properly conveyed in files, Michelle Gallinger, the state electronic records initiative coordinator for the Council of State Archives, said in an email. Another challenge lies in the lack of understanding among state agencies about digital preservation techniques and best practices and the unclear roles and responsibilities around records management, she said.
“Exacerbating these issues is the reality that effective electronic records management remains a low priority and largely underfunded mandate in state government, and many existing digital preservation tools, primarily open-source tools, cannot be supported in existing state/territorial IT infrastructures,” she said.
“Some folks might think, ‘Well once it’s digital, you can search it [or] it’s more organized,’” Stenson said. But technology has enabled mass production and distribution of digital files that “even with a super robust staff” would “take more time than you could possibly allocate to actually go through and review.”
The sheer volume of records is not the only issue. Add to the mix all the ancillary items such as drafts of documentation, internal communications and notes that must be archived as well, Stenson said. Unlike payroll or personnel files, which are “fairly homogeneous” in how they are labeled and organized, there is no clear way to sort these supplemental records.
The Oregon State Archives Division hopes machine learning tools can help staff members with the identification, classification and categorization of records received for preservation and retention. In a recently released request for proposals, the division said it wanted a tool that could “drastically reduce the amount of time and resources spent manually processing and identifying information” and flag potentially sensitive information with an internal marker to alert staff to an item’s sensitive nature when that data is reviewed or requested.
While Oregon already uses an electronic records management system that organizes, secures, schedules and preserves records over time, it does not “whittle down the incoming records before ingest,” Stenson in an email. For example, if the State Archives received 10GB of emails, many of those records may be duplicates or spam that the division would otherwise waste its resources trying to sort, he said.
A processing tool could help significantly reduce data volume. “Costs add up if we don’t, and system efficiency drops when we cram more and more data into the system,” Stenson said.
Oregon’s Archives is not the only one to dabble in machine learning. While the idea is still new to many government agencies, in 2020 the Illinois State Archives applied machine learning tools to 5.3 million email messages from senior officials from the Governor’s Office.
The three-year project in partnership with the University of Illinois System used the e-discovery tool Ringtail to review about 30,000 documents to confidently predict that nearly 60% of messages did not need to be archived and that less than 2% had sensitive information that needed to be redacted, according to Brent West, assistant director for records and information management services at the University of Illinois.
Automation has also reaped benefits for the Vermont State Archives and Records Administration, which recently conducted a six-month pilot program with data privacy and governance software provider ActiveNav to analyze five terabytes of unstructured data from the Agency of Human Services.
In a July conference hosted by the National Association of Government Archives and Records Administrators, Vermont’s State Archivist and State Chief Records Officer Tanya Marshall said that the tool could categorize content by type—identifying documents as a draft—or classify data by how long it had to be retained. The tool helped agencies stay on track with retention and recordkeeping requirements and prepare documents for migration or information requests, she said.
“There is strong interest in AI and machine learning but there are several steps that still have to happen before it becomes a fully integrated part of the state/territorial government archival practice,” Gallinger said. “Clear quality standards will be key for archives to develop the necessary benchmark standards needed to properly evaluate how effective AI and machine learning processing is on the variety of record structures they preserve.”
NEXT STORY: How one state pays down technical debt