Machine learning tool cleans dirty data

Machine learning tool cleans dirty data

Big data is a big deal, but problems within the data can skew results and lead to problematic choices. To help keep data -- and the decisions based on it -- clean, researchers at Columbia University and the University of California at Berkeley have developed new software.

ActiveClean analyzes prediction models to determine which mistakes (e.g., typos, outliers and missing values) to edit first, updating the models in the process, according to Columbia.

“Big data sets are still mostly combined and edited manually, aided by data-cleaning software like Google Refine and Trifacta or custom scripts developed for specific data-cleaning tasks,” university officials said. “The process consumes up to 80 percent of analysts’ time as they hunt for dirty data, clean it, retrain their model and repeat the process. Cleaning is largely done by guesswork.”

To reduce data-cleaning mistakes, ActiveClean takes humans out of the two most error-prone steps of data cleaning: finding dirty data and updating the model. The tool uses machine learning to analyze a model’s structure to determine what errors are most likely to  throw it off and then it cleans enough data to create “reasonably accurate” models.

To see how well it worked, researchers compared the tool’s results against two baseline methods: a model that was retrained based on an edited subset of the data and a prioritization algorithm called active learning that chooses informative labels for unclear data.

The approaches were applied to ProPublica’s Dollars for Docs database of 240,000 records on corporate donations to doctors, nearly a quarter of which had more than one name for a drug or company -- inconsistencies that could lead journalists at the nonprofit news organization to miscount donations, for example. Without data cleaning, a model of this dataset could predict an improper donation 66 percent of the time, while ActiveClean raised that rate to 90 percent after cleaning only 5,000 records, Columbia said. To reach the same rate, the active learning method had to be applied to 50,000 records.

“As datasets grow larger and more complex, it’s becoming more and more difficult to properly clean the data,” said study coauthor Sanjay Krishnan, a graduate student at UC Berkeley. “ActiveClean uses machine learning techniques to make data cleaning easier while guaranteeing you won’t shoot yourself in the foot.”

The open source tool has been available to download for free since August.

About the Author

Stephanie Kanowitz is a freelance writer based in northern Virginia.


  • Records management: Look beyond the NARA mandates

    Pandemic tests electronic records management

    Between the rush enable more virtual collaboration, stalled digitization of archived records and managing records that reside in datasets, records management executives are sorting through new challenges.

  • boy learning at home (Travelpixs/

    Tucson’s community wireless bridges the digital divide

    The city built cell sites at government-owned facilities such as fire departments and libraries that were already connected to Tucson’s existing fiber backbone.

Stay Connected