Mike Daconta

COMMENTARY

10 flaws with the data on Data.gov

Recently released high-value datasets reveal 10 types of deficiencies

Transparency should be a three-legged stool of awareness, access and accuracy. Data.gov, the federal government’s data Web portal, is focusing on the second leg of the stool: access. Of the three, accuracy, which is part of data quality, is the most difficult to achieve but also the most important. If government data is untrustworthy, the government defaults on its backstop role in society.


Related stories:

How Data.gov connects the dots of the federal data reference model

The promise and perils of Data.gov lie in the metadata


So, can you trust the data provided on Data.gov? A cursory examination of the newly released high-value datasets revealed 10 types of quality deficiencies.

1. Omission errors. These are a violation of the quality characteristic of completeness. The No. 1 idea on datagov.ideascale.com, the Data.gov collaboration site, is to provide definitions for every column. But many Data.gov datasets do not provide this information. Another type of omission is when dataset fields are sparsely populated, which might omit the key fields necessary for the data to be relevant. For example, a dataset on recreation sites should have the location of the site. Furthermore, many datasets use codes but omit the complete code lists needed to validate the data. Finally, Extensible Markup Language documents omit the XML schema used to validate them even when the schemas clearly exist.

2. Formatting errors. These are violations of the quality characteristic of consistency. Examples are a lack of header lines in comma-separated value files and incorrectly quoted CSV values. Additionally, this includes poorly formatted data values for some numbers and dates. For example, we still see dates such as “5-Feb-10” with a two-digit year.

3. Accuracy errors. These are violations of the quality characteristic of correctness. Examples are errors in range constraints, such as a dataset having numbers such as “47199998999988888…”

4. Incorrectly labeled records. These are also violations of the quality characteristic of correctness. Unfortunately, agencies are confused as to when to use CSV files versus Excel files. Some datasets are being labeled as CSV files when they are not record-oriented, which they must be, and are just CSV dumps from Microsoft Excel. This indicates a need for more education and training on information management skills.

5. Access errors. These are violations of correct metadata description. Some datasets advertise that they provide raw data, but when you click the link, you are sent to a Web site that does not provide the raw data.

6. Poorly structured data. These are violations of correct metadata description and relevance. Some datasets are formatted using CSV or XML with little regard to how the data would be used. Specifically, some datasets are formatted in nonrecord-oriented manners in which field names are embedded as data values.

7. Nonnormalized data. These errors violate the principles of normalization, which attempt to reduce redundant data. Some datasets have repeated fields and superfluously duplicated field values.

8. Raw database dumps. Although more of a metadata than data quality issue, this certainly violates the principle of relevance. These datasets have files with names such as table1, table2, etc., and are clearly raw database dumps exported to CSV or XLS. Unfortunately, raw database dumps are usually poorly formatted, have no associated business rules and have terse field names.

9. Inflation of counts. Although also a metadata quality issue, many datasets are differentiated only by year or geography, which clutters search results. A simple solution is to allow multiple files per dataset and thereby combine these by-dimension differences into a single search hit.

10. Inconsistent data granularity. This is yet another metadata quality issue that goes to the purpose of Data.gov and its utility for the public. Some datasets are at an extremely high level while others provide extreme detail without any metadata field denoting them as such.

So what can we do? Here are three basics steps: Attract more citizen involvement to police the data; implement the top ideas on datagov.ideascale.com; and ensure agency open-government plans address, in detail, their data quality processes.

Reader Comments

Tue, Mar 16, 2010 Peter Evans-Greenwood Australia

Hard to argue with, and hardly surprising given that the stick was used, rather than the carrot, to get the data out in the open. The incentive is just get a bunch 'o data out in the open, with little regard for usefulness or accuracy. Using a mandate to force departments to open up and collaborate has always worried me. As I've said elsewhere: "You don’t create peace by starting a war, and nor do you create open and collaborative government through top down directives. We can do better." We really need to think about the role of government will play in a post Web 2.0 world[1], and stop thinking in term of us (the open and collaborative Gov 2.0 people) and them (the evil, bureaucratic public service). A lot of departments and regions are going a good job of getting their services and data out in the open. Let's celebrate them, and help them be successful. If they are successful, and this new openness does bring the benefits we claim, then everyone else will join in. After all, no one wants to be on the losing team. http://peter.evans-greenwood.com/2010/02/11/what-is-the-role-of-government-in-a-web-2-0-world/

Fri, Mar 12, 2010 Owen Ambur Silver Spring, MD

With reference to your first point, XML schemas should be published on agency Web sites for all of their data collections. Those schemas should include definitions for each of the elements they contain. Intermediary services like Data.gov and USA.gov should index those elements and definitions to enable query/discovery and analysis of the data and information to which they pertain.

Fri, Mar 12, 2010 Li Ding

that's a good summary, and we at RPI has similar findings, see http://tw.rpi.edu/weblog/2009/07/31/current-issues-in-datagov/

Fri, Mar 12, 2010

The problem is that the government is given a mandate to have X number of data sets online by Y date and meeting that mandate has a high visibility and priority within the organization. I do not think we are taking the quality out of the data, it is not in there to begin with. So, it is a case of successive approximation, where we put up data, you say what is wrong with it and over a long time we work to fix it, unless the whole thing is over taken by the next mandate du jour.

Thu, Mar 11, 2010 Richard Ordowich

Mike, I agree that quality is a critical factor for this data and the data quality dimensions you identified are key. There is one dimension that I believe is over arching and that is relevancy. What knowledge can be derived from the data and what actions could be taken as a result of this information? If the data projects the message “look how well we’re doing”, then the data has been filtered through that lens. Annual reports are a good example of this. Although the guidelines to produce an annual report specify transparency and accuracy, the “message” is predominantly positive. What happens to the data that does not project that message? It is contained in notes to the financials.

To what degree does the data provide transparency? Can that be a quality measure? What filtering processes does the data go through before it is published? Each level of filtering removes some level of data and if filtered too much, the reader is left with limited visibility.

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above