Take two of these and boost your data quality

XXXSPLITXXX XXXSPLITXXX

Data quality, once viewed as a luxury, is now widely regarded as a necessity as enterprises large and small struggle with ever-increasing data stores and new regulations about their mission-critical data. Yet the data quality techniques in use today are still usually simple record-level constraints. We introduce a pair of modern techniques for measuring and improving enterprise data quality.

Some 15 years ago, I sat in a neighborhood courtyard discussing what was then referred to as the Information Superhighway. As a physics graduate student (physicists were among the early adopters) I could think of nothing negative to say. At the table was a professor of Russian studies who said something I’ve never forgotten: "It’s the old question of quantity vs. quality." I disregarded his comments coming as they did from a liberal arts, and decidedly non-technical, source. Looking back over the years, however, I have come to appreciate the prescience of his remark.

We are now officially overloaded with information. We need robust yet usable techniques to sort the good from the bad and make the most of the data that is fast becoming a valuable resource.

Data quality no longer needs to be justified. However, the term data quality is still used to describe record-level constraints or input-field masks. The modern approach recognizes data quality as a science -- it's a discipline all its own.

The data quality practitioner, a new breed of expert, is the reason that any data quality exercise will succeed or fail, regardless of the technology or the methodology. Implementing the modern data quality techniques we describe here will require resources with a wide variety of overlapping skills. Of course, strong IT skills are necessary as are robust programming experience. However, data quality is not an IT problem and so it requires a decidedly non-IT solution. Also required are strong writing, presentation, and interpersonal skills. Without this softer skill set, a data quality initiative will fail.

Two elements are important to understand before we discuss the best practices that can help you measure and improve data quality.

1. Understand your subject

The subject is the single most important concept in the modern data quality approach. The subject is the entity which will be the target of the data quality investigation at the most granular level. Before we begin any data quality initiative we must discover what the subject of the study is. Like most concepts in our approach, the subject is a concept reflected in the data but not attached to any IT object.

As a simple example, if your database is an HR implementation with employee status, hours and earnings information, your subject would be the employee. If your data warehouse is based on manufacturing data, your subject could be the inventoried part. If your database contains insurance information, you could have multiple subjects including policy holders, claims, policies, etc.

If you know your data, or as you get to know it, the subject will become apparent to you. Once identified, the subject becomes more than a concept and will define the granularity with which you will measure data quality. “We identified data quality issues with 20 percent of the employees contained in our database” is a more useful statement than “Thirty-eight percent of the rows in the EMG_ODKJ_34E table have a field that fails one of our data criteria.” Furthermore, how do you combine all the record-level errors you measure to provide a complete picture of data quality? You can’t without the subject.

The subject also manifests itself in the data visualization. Software can show you the data one subject at a time and allow you to juxtapose data from different sources for the same subject. This allows your eye to pick up patterns and idiosyncrasies in the data you wouldn’t notice by looking at the record level.

Finally, when you process the data programmatically, use software that can process the data at the subject level, one subject at a time. This allows you to code more efficiently, examining, changing, and reporting the data, one subject at a time.

2. Define your business rules

Like the subject, the concept of the business rule in the modern approach goes beyond the standard IT definition: jumping into programming, creating SQL statements that grab "bad data" and calling them business rules. Instead, before it is a piece of code, a business rule is a concept that can be shared with technical or non-technical staff alike.

Create business rules that can be expressed in a simple sentence, agree on them, then program them. The programmatic execution is but one property of the business rule. The rule itself should be independent of ties to a database, table, or field; these associations come later. Each business rule must be designed and understood by the entire team. Later, when you review results with non-technical team members, you can speak a language everybody understands.

A key to successfully building those rules is to first build strong relationships with your subject matter experts (SMEs). The knowledge that will make or break the exercise resides in the heads of your SMEs -- the resource most familiar with the data and who knows its history, linage, problems, and idiosyncrasies. All too often, a data quality exercise begins with an over-confident analyst swooping in and walking all over the careful, if low-tech, work done by the SME. It is critical that the data quality practitioner begin the exercise by establishing a relationship of mutual trust and respect with the SMEs.

At the initial phases of the engagement, interview the SME, feel their pain. Each time they explain an issue they see in the data, therein lurks a business rule. Document it, give it a name, and code it.

Once you’ve established a good rapport with your data quality expert, create and execute your business rules, share the results (which the SME will understand because they were involved in the creation of the logic), iterate, and refine. Your goal is to reduce false positives and improve accuracy.

If you feel your rules are stabilizing, prepare random samples of subjects that have passed all, some, and none of your business rules and provide the samples to your SME and to determine if you’re catching the proper subjects.

Once you establish such a feedback loop, and the SME can see not only the implementation of their ideas across all the data, it becomes a very rewarding task to beat down the data quality issues you see together. Without such a team and well moving feedback mechanism, the project will languish and interest will wane on all sides.

To be sure technical skills are necessary, but that ends up being the easy part, establish a trusting relationship with the SME and your project will succeed.

Conclusion

Your humble narrator was recently looking at a hybrid car, this particular model has a display that grows leaves (or the leaves wither and die) in response to the driver’s driving fuel efficiency. It seemed ridiculous to me at first, but I read, and it makes sense to me, that people are programmed to respond to that sort of feedback.

The subject, business rule and SME concepts, when implemented properly, will provide exactly that. A clean display that all parties can look at and quickly determine if the data quality of the information they all care about is moving in the right direction.

Beyond that, these concepts will provide the motivation and the feedback/reward cycle that we are predisposed to respond to.

You just have to set up the project and the path to data quality will reveal itself to you. It’s a solvable problem, no matter how large or complex your data situation. It’s just a matter of setting well defined goals, techniques to measure your progress and a good fluid feedback loop with the experts.

Good luck and have fun!

Editor’s Note: Gian Di Loreto is leading two classes at TDWI’s World Conference in San Diego Aug. 7-12, 2011: Hands-On Data Cleansing: A Laboratory Experience and Hands-On Data Quality Assessment: A Laboratory Experience.

X
This website uses cookies to enhance user experience and to analyze performance and traffic on our website. We also share information about your use of our site with our social media, advertising and analytics partners. Learn More / Do Not Sell My Personal Information
Accept Cookies
X
Cookie Preferences Cookie List

Do Not Sell My Personal Information

When you visit our website, we store cookies on your browser to collect information. The information collected might relate to you, your preferences or your device, and is mostly used to make the site work as you expect it to and to provide a more personalized web experience. However, you can choose not to allow certain types of cookies, which may impact your experience of the site and the services we are able to offer. Click on the different category headings to find out more and change our default settings according to your preference. You cannot opt-out of our First Party Strictly Necessary Cookies as they are deployed in order to ensure the proper functioning of our website (such as prompting the cookie banner and remembering your settings, to log into your account, to redirect you when you log out, etc.). For more information about the First and Third Party Cookies used please follow this link.

Allow All Cookies

Manage Consent Preferences

Strictly Necessary Cookies - Always Active

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data, Targeting & Social Media Cookies

Under the California Consumer Privacy Act, you have the right to opt-out of the sale of your personal information to third parties. These cookies collect information for analytics and to personalize your experience with targeted ads. You may exercise your right to opt out of the sale of personal information by using this toggle switch. If you opt out we will not be able to offer you personalised ads and will not hand over your personal information to any third parties. Additionally, you may contact our legal department for further clarification about your rights as a California consumer by using this Exercise My Rights link

If you have enabled privacy controls on your browser (such as a plugin), we have to take that as a valid request to opt-out. Therefore we would not be able to track your activity through the web. This may affect our ability to personalize ads according to your preferences.

Targeting cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant adverts on other sites. They do not store directly personal information, but are based on uniquely identifying your browser and internet device. If you do not allow these cookies, you will experience less targeted advertising.

Social media cookies are set by a range of social media services that we have added to the site to enable you to share our content with your friends and networks. They are capable of tracking your browser across other sites and building up a profile of your interests. This may impact the content and messages you see on other websites you visit. If you do not allow these cookies you may not be able to use or see these sharing tools.

If you want to opt out of all of our lead reports and lists, please submit a privacy request at our Do Not Sell page.

Save Settings
Cookie Preferences Cookie List

Cookie List

A cookie is a small piece of data (text file) that a website – when visited by a user – asks your browser to store on your device in order to remember information about you, such as your language preference or login information. Those cookies are set by us and called first-party cookies. We also use third-party cookies – which are cookies from a domain different than the domain of the website you are visiting – for our advertising and marketing efforts. More specifically, we use cookies and other tracking technologies for the following purposes:

Strictly Necessary Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Functional Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Performance Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Social Media Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Targeting Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.