PTO produces its first XML documents

PTO produces its first XML documents

Office moves to widely supported and user-friendly XML for an easier way to convert and search its data

BY WILLIAM JACKSON | GCN STAFF

For 30 years, the Patent and Trademark Office has been looking for a way to make patent documents easier to convert and search. It will begin a transition to Extensible Markup Language publishing this year.

PTO already accepts electronic patent applications in XML and will begin issuing most documents, including patent grant copies, in Standard Generalized Markup Language this year. Early next year, SGML will give way to the more widely supported XML.

PTO in 1999 considered going directly to XML, but the standard was not then mature enough, said Dave Abbott, vice president in charge of technology and development for Reed Technology and Information Services Inc. the contractor that converts electronic documents for PTO.

With the older SGML as an interim step, 'transition to XML will be fairly painless,' Abbott said.
Reed Technology will begin producing patent applications in XML next month as the result of a treaty that reconciles U.S. patent practices with those of other nations. Rather than remaining confidential until patents are granted, the applications will be available for public comment 18 months after filing, as they are in many countries.

Reed Technology will produce the new documents using XMetaL 2.0, an XML conversion tool from SoftQuad Software Ltd. of Toronto. Reed has been converting PTO documents since 1970. The company, a member of the Reed Elsevier PLC group, scans and converts about 20,000 patent files per week at its main facility in Horsham, Pa., and satellite facilities in Alexandria, Va.

Wrong read

Reed at first used a proprietary typesetting code called Blue Book. Even then, PTO officials were looking for a searchable electronic format.

'That was pretty forward-thinking in 1970,' Abbott said.

Blue Book was not up to the job, however.

'It served its purpose and made pretty pages, but it doesn't tag the data to the extent we'd like,' said Bruce Cox, manager of PTO's Information Products Division.

Blue Book handles only text, so drawings, formulas and equations had to be added by the Government Printing Office. A scheme to capture formulas was introduced in 1974, and in 1990 drawings went into electronic form. Since 1995, Reed has been producing files in Adobe PostScript format.

By the mid-1980s PTO had enough electronic files to make a searchable database. The Blue Book files were converted to what the agency calls Green Book, a text format that uses the BRS/Search text storage and retrieval system from LeadingSide Inc. of Cambridge, Mass.

The PTO Web site now posts 800G of searchable text files, but Green Book cannot display Blue Book's subscript, superscript and formula formats, Cox said.

So Green Book is being converted to an SGML format'what PTO calls Red Book'to get back the content richness, but the agency has not been pleased with the results, Cox said.

PTO plans eventually to go back and convert the original Blue Book files to XML, Cox said.

Although XML is less complex than SGML, it has tools such as Math ML and Chem ML to deal with formulas and equations in documents. The XML standard has matured since PTO's 1999 decision to go with SGML as an interim format.

'There is broad support for XML outside the publishing industry,' Abbott said. 'XML is so much more widely supported and the tools are so much more user-friendly, it makes sense.'

inside gcn

  • high performance computing (Gorodenkoff/Shutterstock.com)

    Does AI require high-end infrastructure?

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above

More from 1105 Public Sector Media Group