Online Extra: Data cleaning 101
- By Joab Jackson
- Jan 09, 2006
So your office has three databases, and each one has a different spelling of one individual's name. Here's how to clean that data up:Step 1: Identify the data in need of cleaning
Not all data needs to be repaired. Concentrate your resources on the record sets with such low-quality levels that the mistakes could hinder, or already hinder, the project's mission. Try to get a gauge on how bad the quality of the data is. For large record sets, software from companies such as Firstlogic Inc. of LaCrosse, Wis., samples the data and reports on its general usability. Also, locate all the sources of data, if there are more than one.Step 2: Create or identify the ultimate source of the data elements
Parse the data structures into individual data elements, which will give you a framework for building definitive values for these elements. If you have multiple versions of the same basic information, identify the best source for that information. Does one database supercede the others?
Identify external sources of master data. For instance, the Postal Service keeps the authoritative list on ZIP codes as well as a set of rules of what street addresses are found within each ZIP code. Other elements, such as names, may not have a definitive source, such as names ('Jon' and 'John' are both correct spellings).
Compile a list of alternate definitions, such as multiple words for the same entity. Also, define what the null and default values should be for each data element.Step 3: Clean the data
Procure data quality vendor software. This software compares your data with some master source, business rules or with algorithms that can suss out correct forms of different types of information, and then correct the data.
For larger sets of data, you may want to consider setting up a data warehouse, which can aggregate the cleansed data in one central location. Data warehouses require extraction, transformation, transport and load (ETL) software that can move the data from the original databases into the warehouse. This aggregation also cuts down the load on the original database.
Another option is to forgo data cleansing altogether and instead install software with more robust searching capabilities, which could be placed in between the databases and the end users. Companies that offer this software include Identity Systems of Old Greenwich, Conn. Such products look at phonetic matches, near misspellings, variant spellings and other anomalies, returning to users probable matches to their queries.Sources: Frank Dravis, Firstlogic; Ramesh Menon, Identity Systems
Joab Jackson is the senior technology editor for Government Computer News.