Machine learning too expensive for state archives – for now
The Oregon State Archives will pivot from machine learning to advanced analytics for streamlining management of several terabytes of new data.
Having initially planned to use machine learning to help archivists sort through scads of data, the Oregon State Archives will instead deploy advanced data analytics amid concerns over the cost and maturity of ML technology.
Archives officials late last year released a request for proposals calling for ML technology to help process data by removing duplicates and any items that are not pertinent, while indexing it to facilitate future access. The RFP was prompted by former Gov. Kate Brown’s departure from office and the expected transmission to the state archives of up to 10 terabytes of data from her eight years as governor.
But the responses to the RFP forced officials to rethink, Kristofer Stenson, state records manager at the Oregon State Archives, said during Nextgov and GCN’s Emerging Tech and Modernization Summit. At least one bid for the ML contract was priced at three times more than the archives’ entire biennial budget, which Stenson described as “eye opening.”
“I wouldn't call it a failed procurement, in that we did learn a lot from it,” Stenson said. Given the need for the technology to further mature and prices to come down, state leaders decided to “pivot” away from ML.
Instead, Stenson said Oregon will use advanced data analytics to process the records from Brown’s gubernatorial archives. That technology will help identify and remove any duplicates as well as sensitive information like Social Security numbers and phone numbers. It also offers advanced search to scour the archives.
That effort, Stenson said, is “much more doable in the short term” and within the current budget. Other Oregon agencies use similar analytics tools, so there is precedent for the archives using it for now as a “stepping stone” to full ML in the future.
“It still represents a big step forward for us and hopefully will also allow us to more directly provide access to these collections soon,” Stenson said.
The archives is also in the midst of a conversation about how to store the terabytes of documents, communications and other data from Brown’s tenure, which included leading the state government’s response to the COVID-19 pandemic. Stenson said that solution likely will be a hybrid approach with both cloud and on-prem storage to provide redundancy.
Grappling with the sheer amount of electronic data generated by elected officials is an issue that all state archives must face. Stenson said ML will soon be a “critical tool” to manage that information.
“This is real, this isn't a pie-in-the-sky dream anymore,” Stenson said. “This is the world we're going to be living in. While we might not be quite there yet, we are getting there pretty quickly.”
For procurement officials, the episode has shown the state of ML technology’s maturity and the cost for governments to use it. Stenson said while Oregon “may have just slightly jumped the gun” in asking for ML solutions now, it is better to be forward-thinking on emerging technologies rather than behind the curve. “I'd rather be looking ahead than be five years too late,” he said.