Take two of these and boost your data quality
- By Gian Di Loreto
- Jul 19, 2011
Data quality, once viewed as a luxury, is now widely regarded as a necessity as enterprises large and small struggle with ever-increasing data stores and new regulations about their mission-critical data. Yet the data quality techniques in use today are still usually simple record-level constraints. We introduce a pair of modern techniques for measuring and improving enterprise data quality.
Some 15 years ago, I sat in a neighborhood courtyard discussing what was then referred to as the Information Superhighway. As a physics graduate student (physicists were among the early adopters) I could think of nothing negative to say. At the table was a professor of Russian studies who said something I’ve never forgotten: "It’s the old question of quantity vs. quality." I disregarded his comments coming as they did from a liberal arts, and decidedly non-technical, source. Looking back over the years, however, I have come to appreciate the prescience of his remark.
We are now officially overloaded with information. We need robust yet usable techniques to sort the good from the bad and make the most of the data that is fast becoming a valuable resource.
Data quality no longer needs to be justified. However, the term data quality is still used to describe record-level constraints or input-field masks. The modern approach recognizes data quality as a science -- it's a discipline all its own.
The data quality practitioner, a new breed of expert, is the reason that any data quality exercise will succeed or fail, regardless of the technology or the methodology. Implementing the modern data quality techniques we describe here will require resources with a wide variety of overlapping skills. Of course, strong IT skills are necessary as are robust programming experience. However, data quality is not an IT problem and so it requires a decidedly non-IT solution. Also required are strong writing, presentation, and interpersonal skills. Without this softer skill set, a data quality initiative will fail.
Two elements are important to understand before we discuss the best practices that can help you measure and improve data quality.
1. Understand your subject
The subject is the single most important concept in the modern data quality approach. The subject is the entity which will be the target of the data quality investigation at the most granular level. Before we begin any data quality initiative we must discover what the subject of the study is. Like most concepts in our approach, the subject is a concept reflected in the data but not attached to any IT object.
As a simple example, if your database is an HR implementation with employee status, hours and earnings information, your subject would be the employee. If your data warehouse is based on manufacturing data, your subject could be the inventoried part. If your database contains insurance information, you could have multiple subjects including policy holders, claims, policies, etc.
If you know your data, or as you get to know it, the subject will become apparent to you. Once identified, the subject becomes more than a concept and will define the granularity with which you will measure data quality. “We identified data quality issues with 20 percent of the employees contained in our database” is a more useful statement than “Thirty-eight percent of the rows in the EMG_ODKJ_34E table have a field that fails one of our data criteria.” Furthermore, how do you combine all the record-level errors you measure to provide a complete picture of data quality? You can’t without the subject.
The subject also manifests itself in the data visualization. Software can show you the data one subject at a time and allow you to juxtapose data from different sources for the same subject. This allows your eye to pick up patterns and idiosyncrasies in the data you wouldn’t notice by looking at the record level.
Finally, when you process the data programmatically, use software that can process the data at the subject level, one subject at a time. This allows you to code more efficiently, examining, changing, and reporting the data, one subject at a time.
2. Define your business rules
Like the subject, the concept of the business rule in the modern approach goes beyond the standard IT definition: jumping into programming, creating SQL statements that grab "bad data" and calling them business rules. Instead, before it is a piece of code, a business rule is a concept that can be shared with technical or non-technical staff alike.
Create business rules that can be expressed in a simple sentence, agree on them, then program them. The programmatic execution is but one property of the business rule. The rule itself should be independent of ties to a database, table, or field; these associations come later. Each business rule must be designed and understood by the entire team. Later, when you review results with non-technical team members, you can speak a language everybody understands.
A key to successfully building those rules is to first build strong relationships with your subject matter experts (SMEs). The knowledge that will make or break the exercise resides in the heads of your SMEs -- the resource most familiar with the data and who knows its history, linage, problems, and idiosyncrasies. All too often, a data quality exercise begins with an over-confident analyst swooping in and walking all over the careful, if low-tech, work done by the SME. It is critical that the data quality practitioner begin the exercise by establishing a relationship of mutual trust and respect with the SMEs.
At the initial phases of the engagement, interview the SME, feel their pain. Each time they explain an issue they see in the data, therein lurks a business rule. Document it, give it a name, and code it.
Once you’ve established a good rapport with your data quality expert, create and execute your business rules, share the results (which the SME will understand because they were involved in the creation of the logic), iterate, and refine. Your goal is to reduce false positives and improve accuracy.
If you feel your rules are stabilizing, prepare random samples of subjects that have passed all, some, and none of your business rules and provide the samples to your SME and to determine if you’re catching the proper subjects.
Once you establish such a feedback loop, and the SME can see not only the implementation of their ideas across all the data, it becomes a very rewarding task to beat down the data quality issues you see together. Without such a team and well moving feedback mechanism, the project will languish and interest will wane on all sides.
To be sure technical skills are necessary, but that ends up being the easy part, establish a trusting relationship with the SME and your project will succeed.
Your humble narrator was recently looking at a hybrid car, this particular model has a display that grows leaves (or the leaves wither and die) in response to the driver’s driving fuel efficiency. It seemed ridiculous to me at first, but I read, and it makes sense to me, that people are programmed to respond to that sort of feedback.
The subject, business rule and SME concepts, when implemented properly, will provide exactly that. A clean display that all parties can look at and quickly determine if the data quality of the information they all care about is moving in the right direction.
Beyond that, these concepts will provide the motivation and the feedback/reward cycle that we are predisposed to respond to.
You just have to set up the project and the path to data quality will reveal itself to you. It’s a solvable problem, no matter how large or complex your data situation. It’s just a matter of setting well defined goals, techniques to measure your progress and a good fluid feedback loop with the experts.
Good luck and have fun!
Editor’s Note: Gian Di Loreto is leading two classes at TDWI’s World Conference in San Diego Aug. 7-12, 2011: Hands-On Data Cleansing: A Laboratory Experience and Hands-On Data Quality Assessment: A Laboratory Experience.