Automated document conversion helps USPTO reinvent patent processing

Automated XML conversion helps USPTO reinvent patent processing

For many years “government” was synonymous with “paperwork,” and few agencies were more inundated with paper than the U.S. Patent and Trademark Office.

“The Patent Office in 2015 has been averaging 3.7 million pages a month of filing documents supporting claims on inventions, so it’s a huge volume of material that comes into the office,” said Mark Gross, CEO of Data Conversion Laboratory (DCL). “A patent document is typically 30 to 50 pages, but it could be hundreds of pages, and each claim needs to be verified and cross-checked for proper support within the filing document, and that takes time,” he said. “Until recently the USPTO has been three or four years behind in processing patents.”

Part of the problem was sheer volume. With all that documentation coming in – in a variety of formats – automated scanning was mandatory. But many scanning solutions end up requiring humans to interpret the drawings, chemical formulas and images.

Another issue was search. While the patent office switched over to scanning their documents over a decade ago according to Gross, the documents weren’t easily searchable. “With images you can’t find phrases, search keywords and other things that you’re normally used to doing with a Microsoft Word document,” he said.

That all changed three years ago when DCL started working with USPTO to automate the process to convert the paperwork it receives into fully automated Extensible Markup Language (XML) files – documents that are both human- and machine-readable.

DCL built software that cleans up a document – by temporarily taking out the images, the chemistry formulas and the math and leaving the white space – before the optical character recognition (OCR) process. “When you do that, you get an character accuracy of better than 99 percent, which is much better than anything the Patent Office had,” Gross said.

After a year of working with DCL’s XML conversion program, USPTO was scanning 250,000 pages a month, but it has increased its volume and now scans more than 1 million pages monthly, according to Gross.

“Of course we have to keep track of all those things we took out," he explained.  "We call them artifacts -- we have to keep track of where they came from on the page, how big they were and we have to store them. Once the clean page goes through the OCR and is read, it’s formatted into an XML file.”

Unlike previous systems the USPTO has used, the current XML conversion system takes a document and not only turns it into text, but also inserts tags or keywords and other information that can be used to provide information when searching to see if a patent is closely related to another.

“This allows you to do finer searches – you can look for certain keywords in a heading or a sub-heading,” Gross said. “It makes everything easier and more accessible instead of going through millions of documents. You can also reformat it, reorganize it and produce automated indexes.”

“While digitizing content into XML is not new, doing so in a fully automated process changes the economic dynamics,” Gross said, "making it feasible to digitize large content collections and large information flows, like that coming into the USPTO."

The USPTO system automates many technical aspects of the patent review process, which allows patent examiners to focus more of their time on examining patents.

“The USPTO is in the process of modernizing all of its patent examination tools, and the data created in this project allows us to leverage business intelligence to improve the quality of our work,” explained Terrel Morris, the supervisory program manager for the USPTO.

The USPTO and DCL currently are working on upgrading the system so that other federal agencies that deal with large document repositories can use the program as well.

“This level of automation of has never been tried anywhere else, but now we’re talking with other agencies to do this same thing,” Gross said. “Every agency, with budgets the way they are, [is] trying to squeeze everything they can out of every dollar and this way they can do it.”

About the Author

Derek Major is a former reporter for GCN.


  • Records management: Look beyond the NARA mandates

    Pandemic tests electronic records management

    Between the rush enable more virtual collaboration, stalled digitization of archived records and managing records that reside in datasets, records management executives are sorting through new challenges.

  • boy learning at home (Travelpixs/

    Tucson’s community wireless bridges the digital divide

    The city built cell sites at government-owned facilities such as fire departments and libraries that were already connected to Tucson’s existing fiber backbone.

Stay Connected