Mining Arabic documents

In the months after Sept. 11, 2001, Naquib Hatami could have returned to his native Afghanistan to support the new, emerging government. But he chose to stay put in the U.S. and build tools that could be used in the hunt for terrorists and beyond. That year, Hatami founded CiyaSoft Corp., which had developed software for translating Arabic into English. Today, the company is working with the Army to perform optical character recognition of handwritten Farsi, Pashto and Arabic dialects. And those are just a couple of the balls Washington-based CiyaSoft has in the air.
In a GCN conference room, Hatami explained that during the Army project the company discovered'perhaps not surprisingly'that many of the Arabic-written documents making their way back from the field were in bad condition. Whether dirty, smudged, stained, or what have you, many important documents simply weren't ready to undergo optical character recognition. What's more, because of the nature of Arabic writing, even smaller flaws such as dust spots or underlines could confuse the OCR engine.

So on its own dime, CiyaSoft started its Degraded Image Enhancement Program. Hatami demonstrated how the software worked on a variety of marred electronic documents, removing coffee stains or dark backgrounds and making the Arabic more legible prior to OCR. Other products, such as those from Kofax Inc. of Irvine, Calif., perform this type of scanned-document correction for English-language government records.

Hatami said he expects DIEP technology to make its way into the Army's handwriting-recognition program as it becomes 'smarter.' CiyaSoft is making it so the system automatically fixes flaws rather than waiting for user input. The company will be demonstrating the system for government clients this month.

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above