Online Extra: How to avoid getting skewed results
Data sets with missing or inaccurate information obviously can skew data mining results, but how badly?
Data miners need to pay attention to such distortions, said Stephen Langdell, a researcher for Numerical Algorithms Group Ltd. of London.
NAG, a university spinoff, has compiled a library of algorithms that might help U.S. agencies mine data more accurately, he said.
The latest version of the Data Mining and Cleaning Components package uses results from the three-year, $4.5 million Euredit Project
, funded by the European Union's statistics agency.
Langdell said data mining procedures should not omit fields that lack data.
'You can't just leave them out because you might be leaving out' the majority of the data, Langdell said.
A demographics expert would need to know if, for example, survey data is missing because some participants refused to answer particular questions.
'It may be a specific group of people, or an age group, so extrapolation would not work,' Langdell said.
Instead, the Euredit algorithms look for distribution patterns in a data set and try to determine the best ways to use the data that is present.
Another target of Euredit is erroneous or incorrectly entered data, which can skew results wildly, Langdell said. A few values far from the normal distribution will greatly influence the result, especially if subsequent calculations compound the original error.
'They are really quite dangerous things, and you need to identify them,' he said.
The Euredit algorithms detect incorrect data in two ways. One method involves checking a suspicious value against all possible values. For example, if a child's age is correctly recorded, it is physically impossible for the mother's age to be less than the child's.
Another, less exact method disregards values that fall outside the normal range.
The Euredit algorithms are not yet part of any commercial data mining software, Langdell said, but government statistical offices in Britain, Finland and Italy use them.
To make the library's formulas easy to incorporate into existing or new applications, they are written in ANSI C, Langdell said. Users can write scripts to apply the algorithms to data sets, using programming or scripting languages such as C#, Java, Perl and Python.