Online Extra: How to avoid getting skewed results

Data sets with missing or inaccurate information obviously can skew data mining results, but how badly?

Data miners need to pay attention to such distortions, said Stephen Langdell, a researcher for Numerical Algorithms Group Ltd. of London.

NAG, a university spinoff, has compiled a library of algorithms that might help U.S. agencies mine data more accurately, he said.

The latest version of the Data Mining and Cleaning Components package uses results from the three-year, $4.5 million Euredit Project, funded by the European Union's statistics agency.

Langdell said data mining procedures should not omit fields that lack data.

'You can't just leave them out because you might be leaving out' the majority of the data, Langdell said.

A demographics expert would need to know if, for example, survey data is missing because some participants refused to answer particular questions.

'It may be a specific group of people, or an age group, so extrapolation would not work,' Langdell said.

Instead, the Euredit algorithms look for distribution patterns in a data set and try to determine the best ways to use the data that is present.

Another target of Euredit is erroneous or incorrectly entered data, which can skew results wildly, Langdell said. A few values far from the normal distribution will greatly influence the result, especially if subsequent calculations compound the original error.

'They are really quite dangerous things, and you need to identify them,' he said.

The Euredit algorithms detect incorrect data in two ways. One method involves checking a suspicious value against all possible values. For example, if a child's age is correctly recorded, it is physically impossible for the mother's age to be less than the child's.

Another, less exact method disregards values that fall outside the normal range.

The Euredit algorithms are not yet part of any commercial data mining software, Langdell said, but government statistical offices in Britain, Finland and Italy use them.

To make the library's formulas easy to incorporate into existing or new applications, they are written in ANSI C, Langdell said. Users can write scripts to apply the algorithms to data sets, using programming or scripting languages such as C#, Java, Perl and Python.

'Joab Jackson


  • business meeting (Monkey Business Images/

    Civic tech volunteers help states with legacy systems

    As COVID-19 exposed vulnerabilities in state and local government IT systems, the newly formed U.S. Digital Response stepped in to help. Its successes offer insight into existing barriers and the future of the civic tech movement.

  • data analytics (

    More visible data helps drive DOD decision-making

    CDOs in the Defense Department are opening up their data to take advantage of artificial intelligence and machine learning tools that help surface insights and improve decision-making.

Stay Connected