Solving the search problem for large-scale repositories
- By Mark Gross
- Apr 22, 2015
Government is drowning in paper and, more frequently, in scanned images of paper. With agencies receiving hundreds of thousands of new documents each year – which can sometimes range to thousands of pages – the content easily gets lost in unsearchable repositories, and staffers struggle to find the right document in a massive document databases.
Solutions exist to help these agencies organize and structure the documents, often by converting images and scanned documents to Extensible Markup Language (XML), which provides for sophisticated search, verification, document review and “printing” to a variety of devices. However, traditional conversion methods require significant manual effort and are economically unfeasible, especially when agencies are often precluded from using offshore labor. Additionally, government conversion efforts can be restricted by document security and the number of people that require access.
However, there have been recent advances in the technology that allow for fully automated, secure and scalable document conversion processes that make economically feasible what was considered impractical just a few years ago. In one particular case the cost of the automated process was less than one-tenth of the traditional process.
Making content searchable, allowing for content to be reformatted and reorganized as needed, gives agencies tremendous opportunities to automate and improve processes, while at the same time improving workflow and providing previously unavailable metrics. To achieve the benefits of conversion, though, these new processes need to be designed to mitigate a number of complicating factors:
Lack of control on input: Source data is often represented by numerous distinct document types with few formatting requirements.
Need for accuracy: Typical optical character recognition (OCR) engines average about 97 percent accuracy, new techniques have driven accuracy to over 99.5 percent.
Non-textual elements: Scientific and technical documents contain extensive non-text elements, such as mathematical formulas, complex tables, charts and illustrations, all of which hamper OCR accuracy.
Specialized delivery needs: Agencies may have particular requirements, XML and others, to suit specific needs.
Security: Systems need to comply with strict government privacy and security regulations.
Scalability: Although the initial intent may be to start small, with success the requirement can quickly ramp up, and millions of pages should be processed per month. Systems need to be designed to accommodate growth.
The key to the conversion challenge is developing a process that allows sufficient data can be extracted from a document without human review. For example, the solution might need to focus on automatically removing “noise” from the images so that the OCR engine operates at peak accuracy. Such a solution requires software to automate the image detection and extraction processes. With that completed, the solution converts accurately scanned text to well-formed XML, applying the metadata required to deliver fully searchable and retrievable text.
Expertise is needed to tune the implemented solution to the XML requirements of a specific agency. That includes tailoring conversion systems to match the tagging requirements of the selected XML schemas and developing the stylesheets that can recreate original documents for viewing purposes. Additionally the solution should provide appropriate meta tags for storage and access via content management systems or front-end applications. Processes can be designed to run 24/7, where the agency and vendor both monitor production flow and perform quality checks.
End-to-end solutions should allow for continuous improvement, and the implementation should focus on using increasing sample sizes to continually refine the software to recognize and accept reasonable variations requiring post-processing to XML.
In designing solutions for federal agencies with vast document repositories, IT staff should look for flexibility – in design, workflows and compliance with changing requirements. Conversion of documents into searchable text with relevant and accurate metadata, and the capacity to render output that matches the original or can be consumed in multiple formats requires more than even the best OCR engine can accomplish. Here are some high-level criteria to consider when searching for a provider:
- Variety and complexity of the source data.
- Variety and complexity of outputs (e.g., formats, styling requirements, component reuse).
- Anticipated volumes and level of automation required.
- Metadata requirements for post-processed content.
- Standards and requirements for schemas and outputs.
- Needs and requirements for identifying and processing exceptions manually.
- Integration requirements, particularly with OCR or scan systems.
The key takeaway is that technology has developed significantly in the past few years, and automated processes that were infeasible just a few years ago now can provide a solution to dealing with the terabytes of data that all government agencies are collecting.
Mark Gross is president of Data Conversion Laboratory (DCL).