OCR still doesn't beat a good pair of eyes

It's not that desktop scanners haven't improved, because even a decade ago, they
captured images pretty well. Scanners these days have far better nonflatbed-feed
mechanisms, higher resolution, lower prices and better color processing.


If you want to implement a large document image scanning and storage project today, you
don't have to worry about whether scanners are up to the task, or even whether mass
storage is cheap enough.


The big question is whether you can find good enough indexing software and whether you
can manage the images in an object-oriented database.


Planning for such a transition from tried-and-true older systems takes a lot of time,
because agencies have learned from bitter experience that they must test in advance and
perhaps even operate parallel systems for a time.


Only recently has CD-recordable technology brought capacious, inexpensive optical
storage into scanning environments. Many agencies have experimented with magneto-optical
drives for document imaging, but these tend to be expensive and unsuited for mass
replication and distribution.


CD-ROM remains the most cost-effective medium for publishing and storing images
electronically. Until recently, mass CD replication had to be contracted out to service
bureaus. Few agencies so far have installed their own CD publishing operations.


Perhaps the most visible and successful government CD-ROM imaging project has been at
the Patent and Trademark Office. PTO stores and distributes tens of thousands of patent
document images on computer-readable disks that cost less than $2 each to produce.


More than a decade ago, when I began creating benchmark tests for office scanners and
OCR software, most documents were output by dot-matrix or other impact printers in simple
fonts with fixed spacing between letters and words. Just as now, manufacturers claimed OCR
accuracy rates around 98 percent.


My actual results did almost reach 98 percent accuracy on simple documents from daisy
wheel printers with single-use ribbons.


However, accuracy plummeted when characters from the top row of the keyboard such as !,
@, # and $ appeared in a document.


Scanning photocopies produced many more errors. OCR on dot-matrix output seldom reached
even 90 percent accuracy, and typeset pages from books or newspapers gave unacceptable
results. Small and highly stylized fonts such as those on business cards couldn't even be
recognized.


OCR software has improved in the past decade, but the complexity of the task has
increased, too. Laser printing is the rule now. Although it is crisp and easy to read,
it's much more difficult to process optically than typed documents because today's users
tend to place a large collection of fonts on even a simple memo.


That means advances in OCR software and processor speed have barely kept pace with
changes in printing. Proportionally spaced and tiny fonts pose the greatest OCR
challenges, and they pop up all over the place.


If you doubt that OCR is still immature, consider that many companies refuse to convert
printed material by OCR if they need it in an electronic format. Instead, they routinely
ship it off somewhere to be rekeyed by hand.


Software has made great strides at recognizing italic, typeset and exotic fonts, and it
has gotten better at helping users correct the inevitable scanning errors.


But the maximum accuracy it achieves on what it can decipher has not improved
significantly.


Further rises in OCR accuracy will be incremental, but we will see significant
improvements in the software interfaces. At the least, today's rather unfriendly OCR
programs will begin to match the best of their breed.


One potential advance: Neural networking might significantly extend OCR's accuracy.
Neural network software learns as it works if the content and the format are somehow
related.


But a random collection of scanned documents offers little that is useful for the
software to learn.


In contrast, document image scanning has advanced by leaps and bounds with better
software, desktop scanners and low-cost storage. Even image transmission is easier now
with dial-up Internet access and satellite channels.


The bottleneck is database management software, which is still struggling to catch up
to the vast quantities of images. Once this bottleneck loosens, most offices probably will
begin to work with still images and even video.


Will OCR ever gain the same degree of acceptance? Scanning images of documents is easy;
converting them into text is much more time-consuming and difficult. Just consider how
hard it is to verify scanning accuracy.


As you capture an image, a mere glance can tell you whether a particular page is
missing some text, has proper margins or is blurred. Without reading a single word, you
can identify problems in the imaging operation.


But when you turn to OCR to process important document images-would you waste time on
unimportant ones?-someone must visually compare every word and number on every page
against the originals and make corrections.


Some documents merit such care and expense, but the majority of office documents don't.


When OCR first became prominent, most PC and network applications were either
character-based or had very rudimentary graphical features.


Now graphical elements are embedded in the operating environments, and images often are
shared across networks. The future of OCR hangs in large part on these new business
practices.


Just consider how many documents bear logos or graphics of some sort simply because
printers and word processors have made it easy. But OCR captures only the text parts. To
have a full electronic version, it's necessary to store an image of the whole thing.


What does this mean for the paperless office? How will information flow in the typical
office of 2020?


We can feel pretty certain that images and electronic communication will play an
increasing role.


Video and audio annotations are becoming more common. Indexing and storage for complex
multimedia documents will improve. It is likely that most documents will be created,
displayed and disseminated electronically in the future.


Hard copy will appear only at the end user's desk, and then only sometimes. As display
technology improves, fewer people will demand paper printouts, particularly when documents
contain vital audio and video elements.


OCR will continue to play a role, but it will have to be backed up by human
proofreaders until machine intelligence advances enough to proof scanned documents
reliably.


Today's OCR software automatically spell-checks documents. However, software grammar
checking so poorly reflects good writing practice that automated grammar correction is a
long way off.


Also, proofreading requires a great deal of context checking that computers just can't
do and may not be able to do even for several more decades. Considering the move to
multimedia, OCR is more likely to find its star role in indexing and cataloging.


Image scanning can take place virtually without human intervention, but indexing
documents and abstracting the information is very worker-intensive.


Even if OCR never achieves any higher accuracy, it can become a valuable tool for
picking out key words and terms from documents and summarizing them. An occasional error
in recognizing a letter or number won't matter too much, because the image of the original
document would still be available for reference.


A knowledgeable worker who scanned a 100-page document for image storage could review
and edit an electronically prepared index and summary in a fraction of the time it would
take to edit the document as OCR text.


That means the truly paperless office will depend less on scanning and OCR than on the
new paradigm of how documents are created, stored, transmitted and viewed electronically.
Hard copy will play only a temporary role.


If that scenario takes hold, there still will be a small flow of physical documents
that need converting to electronic form.


Many scanners bundle a version of OCR software. If this meets your needs, it's
essentially free.


To get the most out of OCR, try these tips:


Probably the largest OCR project ever attempted in or out of the federal government is
the Postal Service's scanning of envelope addresses.


This project involves 875 multiline optical character readers that can scan envelopes
at a rate of 12 per second, or more than a half million per minute when every machine runs
full blast.


The devices can image addresses, OCR them, look up ZIP code directories and ink-jet bar
codes onto the envelopes.


Although this is a highly successful OCR implementation, it's unique to USPS and not
very applicable to other agencies planning OCR projects.


There are other government OCR success stories, but most involve scanning a limited
amount of information from forms. The Massachusetts Revenue Department, for example, said
that it has doubled the number of forms each employee can process in a day with high-end
scanners.


John McCormick, a free-lance writer and computer consultant, has been working with
computers since the early 1960s.


inside gcn

  • artificial intelligence (ktsdesign/Shutterstock.com)

    Machine learning with limited data

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above

More from 1105 Public Sector Media Group