2020 Government Innovation Awards
Extracting tax data from scanned images
- By Stephanie Kanowitz
- Nov 16, 2020
A computer vision, artificial intelligence and natural language processing platform that standardizes scanned document images and extracts select data is streamlining how IRS manages its data and interacts with taxpayers.
Built with open source technologies, the Appeals Case Memorandum (ACM) project locates key areas of a document – right now, tables – to normalize dimensions, remove whitespace, correct rotation, identify revisions and isolate relevant text fields for optical character recognition. The data in the documents is applicable to many use cases, including documenting IRS’s findings on a particular case.
“The division within IRS that deals with our large corporate taxpayers, they were interested in understanding how some of their cases resolve in the appeals process, and for them to manually go through some of those ACM documents, it was just too much of a manual task,” said Ron Hodge II, senior manager at IRS’ Research, Applied Analytics and Statistics Data Management Division. “What this would allow folks to do is have a more comprehensive understanding of what happened to their case from end to end vs. losing visibility once it went into the appeal function.”
For context, IRS estimates that it received more than 120 million pages of correspondence from 2010 to 2015, requiring up to 31,000 full-time employees to process and about 8 billion hours of taxpayer time.
“What we’ve been able to is out of thousands of cases and tens of thousands of pages within those cases, we’ve been able to extract seven years of these tables and actually put it in a centralized location,” Hodge said. This enables users to efficiently interact with the data.
He likens the process to the way a smartphone superimposes a square around faces when taking a photo. “What finds that person’s face is actually an object-detection algorithm,” Hodge said. “For our particular use case, we were interested in [finding] a table ... embedded within text.” Although a table isn’t a specific object, it has a specific structure: rows and columns.
Most recently, the technology has helped IRS with its Coronavirus Aid, Relief, and Economic Security (Act response. The CARES Act introduced new machine-readable tax forms related to business credits, so the agency needed a way to efficiently extract data from those. To do that, IRS applied the work it did for the ACM project.
“We don’t have to rebuild it from scratch because we have enough of the capability built out,” Hodge said. “Now it’s just adding some additional tweaks at the margin to help users in different use cases.”
Three business units use the technology now, but it’s applicable agencywide, Hodge said, adding that his team is working to add features.
“Because this document also contains narrative information, we’re also going to be extending into using natural-language processing techniques to help identify the background and support and rationale for a particular decision,” he said. “It’s all to make tax administration more efficient and to make it so we’re able to serve the taxpayers much more efficiently.”
Stephanie Kanowitz is a freelance writer based in northern Virginia.