twins

Watch lists: How do you find Abdul Rasheed when you're looking for Abdul Rashid?

“What’s in a name?” Shakespeare’s Juliet once asked.

Quite a lot, it turns out, especially for analysts trying to match non-Latin names against western watch lists and financial databases. Name-matching can pose serious challenges for law enforcement and financial compliance officials accustomed to an English-language view of the world. 

Analysts have to contend with multiple transliteration variants of foreign names, name order variations and misspellings. Moreover, traditional software approaches for name resolution can miss matches or retrieve information that is not pertinent to the subject matter. It is no longer just about matching Jimmy L. Smith with James Larry Smith but also Abdul Rasheed with Abdul Rashid or Zhang Jing-Quan with Jingquan Zhang, experts say.

“Language has always been a challenge when analyzing data originating from across the globe. There is tremendous variation among the spoken and written words of related but different socio-cultural groups, especially in regions that speak languages with Asian and Arabic origins,” said Bob Flores, president and CEO of Appicology, a consulting firm specializing in technology issues.

“The ability to accurately and efficiently translate, transliterate and analyze language, even as digital text, has been a complex issue that analysts have been trying to address for years,” Flores, a former CTO with the Central Intelligence Agency, said. “Proper names are an even greater challenge given the amount of variation in how each one is represented—nicknames, initials, titles, misspellings.”

With English naming conventions, a person's name generally consists of two words, each drawn from a standard pool of “last names” or “first names” such as Simon, Michael or John, with an optional third name inserted in the middle, Flores explained. But this is not the case with names from other countries and languages.

With Russian, Chinese or Arabic names Flores said two problems arise. “The name might appear in the original language and writing system or the language may subscribe to completely separate naming conventions making it difficult to transliterate to the English language or Latin symbols,” Flores said.

These spelling variations in the names of suspects can set up roadblocks for preventative intelligence, some experts say. For instance, the name of the late Tamerlan Tsarnaev, one of the suspects in the Boston Marathon bombing, appeared in English and Russian news reports and even passenger logs with different spellings or transliteration variants.

“The challenge we face with names is finding them, matching them and indexing them,” said Carl Hoffman, CEO of Basis Technology, a provider of multilingual text analytic software used by the U.S. defense and intelligence communities. Basis Technology’s Rosette linguistic platform provides multilingual text retrieval; extraction of names, places and other identifiers from unstructured documents; foreign language translation; and cross-script name-matching.

The Rosette platform supports 40 languages, including Arabic, Chinese, Dari, Farsi, Hebrew, Japanese, Korean, Pashto and Urdu, as well as languages with Latin alphabets such as English, French, Italian, German and Portuguese, Hoffman said.

The Rosette Name Indexer is a hybrid text analytics tool, uses four main methods for name-matching: common key, edit distance, list and statistical similarity. RNI’s hybrid approach allows the strength of one method to compensate for the weakness of another. For instance, the common key method reduces names to a key or code based on their English pronunciation so similar sounding names share the same key. Common key methods can quickly produce high-recall results, but they are not precise, especially for non-Latin script names, which must first be transliterated to Latin characters before using the method. 

A statistical approach assigns a probability that two names match or not. This method produces high precision results but may be slower than the common key method. RNI uses the common key method on its “first pass” at matching to take advantage of the high recall and speed. A second pass over the pool of candidates resulting from the first pass uses the slower but highly precise statistical method to make up for the common key’s low precision, company officials said.

Basis Technology has created a critical solution to the linguistics puzzle, Appicology’s Flores said. The company’s name-matching platform has been integrated into several government applications, increasing the effectiveness and accuracy of identity resolution systems and name search applications, he noted.

Other companies also offer government agencies name-matching solutions. For instance, NetOwl’s NameMatcher provides name-matching capabilities for names spanning many cultures, ethnicities and languages. NameMatcher uses machine learning to tackle the name variant challenge. It automatically learns and generates tens of thousands of name-matching rules, avoiding manual development and rule tuning, both of which are labor intensive, according to company officials. 

About the Author

Rutrell Yasin is is a freelance technology writer for GCN.

Reader Comments

Tue, Sep 17, 2013 Sundar

Once Sreenivasa Sista and I attempted to "learn" the equivalence between names in Telugu, Tamil, and English using his custom machine learning engine while we worked at Yahoo!. That engine would map based on common transcriptions as opposed to literal transliterations. Because, while transcribing names, spellings correspond to conventions and not to literal sequence of phonemes. For example, in "Bala", a common Indian name, both the a's are long vowels, but the name is almost never written as "Baalaa" in English.

Wed, Sep 11, 2013

And none of it does any good if your name, age, state of residence, and profession are the same as a convicted felon. - Not posting my name for obvious reasons.

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above