Obsolete algorithm tangles terrorist/criminal watch lists
- By Wilson P. Dizard III
- May 20, 2004
Advanced name-matching software now used for some agencies' terrorist and criminal watch lists can handle linguistic differences and Arabic, Chinese, Russian and other non-Roman alphabets.
But the FBI's National Crime Information Center and other watch list databases continue to use variants of Soundex name-matching technology dating back to World War I.
Mark Tanner, director of the FBI's Foreign Terrorist Tracking Task Force, acknowledged that 'NCIC's name-matching is not as robust as some other tools. It's what's available to police officers in [patrol] cars. I think we should upgrade it, and I think funds have been requested for that.'
Zalmai Azmi, CIO of the FBI, said officials are evaluating the replacement of Soundex at NCIC.
Two other terrorist watch lists that use Soundex are the Homeland Security Department's National Automated Immigration Lookout System II and the Interagency Border Inspection System.
According to a report this month by the Congressional Research Service, NAILS II has 3.8 million files, about 58,000 of which concern suspected or known terrorists and their supporters.
NAILS II's Soundex technology was patented almost a century ago, said the CRS report, Terrorist Identification, Screening and Tracking under Homeland Security Presidential Directive 6.
Although IBIS is 'superior to NAILS II in terms of systems performance and name recognition, it is not considered as robust as the [State Department's Consular Lookout and Support System] in terms of certain search functions,' the report said.
As for the FBI's criminal and terrorist database, CRS found that the Soundex technology in NCIC 2000'an upgraded version of the NCIC fielded in 1999''is not nearly as robust as the name-recognition technologies built into CLASS. For example, the length of the name field in NCIC is only 28 characters, while it is over 80 in CLASS.'
Homeland Security CIO Steve Cooper said he is concerned that Soundex technology could match names inaccurately. 'But the good news is that we have more sophisticated technologies that will replace it,' he said. 'We are applying some much more robust technologies' to avoid reliance on Soundex, 'which is still the most prevalent algorithm that is applied in a lot of the software for this kind of thing.
'Our science and technology unit, the intelligence community and the Office of the CIO are all looking at and have identified better, newer technologies and will continue to do so,' Cooper said.
Soundex relies on converting a name to a key code, according to specialists in name-matching technology.
In most versions, Soundex takes the first letter of the last name, drops all vowels and assigns a number to the next three consonants. The algorithm then drops the rest of the consonants. For example, the surname name Dizard would become D262, the same as the code for Dokerovich or Dagorman.Prone to mistakes
Soundex technology is inexpensive, but federal officials and vendors agree that it is prone to mistakes, especially with foreign names. Soundex systems do not capture the punctuation of names in non-Roman alphabets and sometimes cannot track alternate spellings of a name.
'I consider Soundex yesterday's technology,' said one federal official, speaking on condition of anonymity. 'There is no excuse anymore for not having advanced systems when you can pull them right off the shelf.'
The intelligence community and DHS use name matching not only to detect terrorists crossing borders but also to find, in unstructured information such as intelligence messages, names of individuals who might associate with known terrorists. Ian Hersey, co-founder and senior vice president for corporate development and strategy at Inxight Software Inc. of Sunnyvale, Calif., said intelligence agencies use his company's ThingFinder software to sift names out of terabytes of data.
The CIA's In-Q-Tel venture capital arm has invested in In-xight, and ThingFinder is fielded at the Defense Intelligence Agency and other intelligence agencies, Hersey said.
ThingFinder can identify items such as telephone numbers and serial numbers in unstructured data, but 'the trickier ones are name-type entities,' Hersey said. 'There you have to use the linguistic context. It is computational linguistics.'
John C. Hermanson, chief executive officer of Language Analysis Systems Inc. of Herndon, Va., is a computational linguist whose company has built a database of more than 1 billion names from many countries.
The company's software is used in data mining, partly to apply parameters to names from each language, Hermanson said. The parameters identify what distinguishes names from one another, depending on language and country.
Stuart Patt, a foreign service officer and spokesman for the State Department's Consular Affairs Bureau, said the department has developed sophisticated algorithms for matching Arabic and Russian names, 'and we are building one in Chinese.'
Patt said he recently took a weeklong course in CLASS name-matching methods.
'The purpose of the training is to teach how our system works and why,' he said. State employees use CLASS to check records of people applying for visas.