Ready for reuse?

The government’s effort to make data available will require common formats, data governance and a conscious effort to keep the data raw, which means resisting the urge to put it into context.

The White House's recently launched Data.gov site sets a new standard for the presentation of government data. Prepping such data, however, will be no small task.

Last month, the Obama administration unveiled a Web site that offers the public access to data feeds from various agencies. Although the number of feeds is modest, the site’s debut could signal a radical new way in which government agencies must handle and release their data.

Spearheaded by federal chief information officer Vivek Kundra and the CIO Council, Data.gov was assembled in less than three months. The initial offering has links to 47 datasets from a variety of agencies available in a variety of formats. For example, the U.S. Geological Survey has submitted its National Geochemical Survey database, which offers a nationwide analysis of soil and stream sediments, and the Fish and Wildlife Service has presented data about waterfowl migration flyways.

For Web surfers, the site might not seem very exciting. Its intended audience is not casual users but software developers. Federal officials are presenting data in machine-readable formats to make it easy for private or nonprofit organizations or individuals to reuse the information to build applications and services. By providing the raw material, officials hope that "the public sector [can] solve some of the toughest problems that this country faces," Kundra said in an interview with Government Computer News.

By getting Data.gov up and running, the administration is trying to make good on its promise for government transparency and answer the growing call for data-oriented interaction between the public and agencies.

"In order for public data to benefit from the same innovation and dynamism that characterize private parties’ use of the Internet, the federal government must reimagine its role as an information provider," wrote David Robinson and fellow Princeton University academics in the Yale Journal of Law and Technology. "Rather than struggling, as it currently does, to design sites that meet each end-user need, it should focus on creating a simple, reliable and publicly accessible infrastructure that 'exposes' the underlying data."

However, creating such an infrastructure will involve some challenges in the coming years.

Harbinger of change

Data.gov offers data in several formats, including Extensible Markup Language (XML), comma-separated values (CSV), plain text, Keyhole Markup Language and ESRI’s shapefile format for geographical data. The site also offers links to agency Web pages that allow people to query specific databases. For example, it has a link to the General Services Administration's USAspending.gov site, which lets users search information on federal contracts.

Data.gov also has links to various agency widgets — small applications that can be embedded in Web sites. For example, an FBI widget offers brief summaries and photos of the bureau’s 10 most-wanted criminals.

Kundra said Data.gov gives just a taste of what exists at federal agencies. He estimated that the government has more than 10,000 systems, and material from many of them could end up on Data.gov or be accessible to the public in other ways. "The default assumption is that…data that is generated will be fed into Data.gov," he said.

"We don't want to release information that is sensitive in nature or compromise national security in any way,” Kundra added. “But other than that, it should be public."

The government’s job should be to make the data available but not decide what types of data are more valuable than others or better suited for reuse. "We don't have a monopoly on the best ideas or the best approaches,” he said. “For one person, the data may have no value, but for another person, it may be the missing link in solving a very difficult problem."

Of course, extracting data from many agencies’ older systems will prove to be difficult. And as they install new systems, agencies will need to incorporate data-sharing capabilities. Both activities require some planning on the part of federal information technology managers.

What could go wrong?

Getting government data to the point of reusability is no small task. Program managers who want to make their datasets available can start in one of two places: They can contact Data.gov through the Web site’s feedback form, or they can work with their CIOs.

"Databases aren't by their very nature public," said Chris Warner, vice president of marketing at JackBe, which offers mashup software that the Defense Information Systems Agency is using to help intelligence agencies gather information.

Databases don't make much sense to the people who didn't design them or don't use them on a daily basis. But the data they hold is increasingly in demand.

One big user of government data is the GovTrack Web site, which offers updates on congressional bills and information on committee members and their voting records. Joshua Tauberer, the site’s developer, said he gets some of the information by way of Really Simple Syndication feeds. But what would make the site even better is if he had access to the Library of Congress database that holds the information.

"The thing that they should be doing is providing the database as a bulk data download," Tauberer said, referring to the Library of Congress’ Thomas Web site for federal legislative information. "They’ve already got it in XML. There is no reason not to put it out there."

However, some government data sources aren't even encoded in proper databases.

For instance, the Federal Election Commission offers its data as old Cobol files, said Clay Johnson, who leads technical initiatives at the nonprofit Sunlight Foundation’s Sunlight Labs. The organization captures legislative and agency data and reformats it for use by advocacy groups and others.

"It's not up to par," Johnson said of the FEC data feeds. For instance, Cobol doesn't support negative numbers and instead offers a special key that is used to allow another system to produce a negative number after a brief calculation. Cobol also truncates data, such as work titles, which muddies the data even further.

One company with a good view of the state of government data, at least as far as contracting goes, is analyst firm FedSources, which provides agency and contract information to vendors in the government market. The company gets its data from multiple sources, notably GSA’s Federal Procurement Data System.

Ray Bjorklund, senior vice president and chief knowledge officer at FedSources, said the government still has a long way to go before it can offer a wide array of useful data sources.

Despite efforts on the part of Congress a few years ago to bring more transparency to government contracts through efforts such as USAspending.gov, much of the work that FedSources does to collect and present government data is still a manual process.

"We have a system for getting the data," Bjorklund said. “It is the structure of the data that makes it much more challenging [to ingest] in a fully automated way.”

For instance, not all agencies report on their contracts in the same way. As a new method of reporting data is adopted, reports formatted the old way can’t be updated. Also, mistakes can creep into the data, which get compounded as they spread across more reports. Bjorklund said a FedSources analyst noted that an agency reported awarding a $6 billion contract to a small company. The analyst flagged it as a mistake, and the agency corrected it a few months later. But had the mistake not been noticed, such a large number would have skewed contract reports for the agency or the company.

Aggregating data will produce the biggest headaches. One agency will have enough problems ensuring consistency, but combining data from multiple agencies, which would give users a better understanding of governmentwide trends, will be much harder. The task of aggregating data sources into sites such as USAspending.gov is an immense one, even though that site handles one small slice of government information, Bjorklund said.

Enemy of the state

Getting data into a coherent state will require data governance, which takes a considerable amount of work. Multiple agencies will have to agree on how data should be structured, such as through a common taxonomy or ontology, and on a precise dictionary of terms.

"There are lots of islands of data within the government that are well-structured," Bjorklund said. "Will you try to tie all those definitions together and reconcile them? Will you have agencies redesign their data environments to meet some new taxonomy or infrastructure?"

The first step any agency should take is to encode its data in XML.

"It is important to put the data in a format where there is metadata or tags around the data, so when one is looking at the data, they know what they are looking for," said Ken Raffel, senior director of business intelligence at consulting firm Guident. "If you just publish via CSV, there is no easy way to consume the data. But if you have XML, you have tags that describe the data."

Data quality is also important, said Peter Snyder, who maintains OpenRegs.com, which reformats information from the Federal Register and makes it easier to search by grouping it by agency or topic. "We spend a good amount of effort into correcting the mistakes that are in the data," Snyder said. “I make a number of guesses on an abbreviation.”

Another issue agencies should resolve is what data to use.

"Just because you have a data source, this doesn’t means you want to publish all information of the data source,” Raffel said. “You may just want to publish a portion of that."

In that case, IT employees should work with program managers to determine which data is relevant for outside consumption and use business intelligence software to package the reformatted data.

Easy on the polish

Formatting data for the public is a big job, but agencies should beware of spending too much effort shaping the data.

Feeds for Data.gov differ from how data is presented on an agency's Web site, Kundra said. Agency employees understand the audience and can frame the information in the correct context for the agency’s site. But for raw data feeds, the agency should not make assumptions about how the data will be used. "That is a huge distinction," Kundra said.

"People are very tempted to keep [their data].… You don't want to let go of the data until you've made a Web site for it," said Tim Berners-Lee, creator of the World Wide Web, in a recent discussion of government data that was broadcast via the Internet. "What I'd like to suggest is that before you create a beautiful Web site, give us the unadulterated data. We have to ask for raw data."

For instance, before Data.gov was launched, Johnson warned developers against using too many pie charts, graphs and other eye candy because that approach would delay the site’s release and cut down on the amount of data offered.

"Once you start applying visualizations to data, you start applying context to it as well, and we don't think this is the government's job,” Johnson said. “Visualizations are inherently editorial, and I'm not sure that's government's responsibility. I'm not saying don't do visualization, but don't do data visualization first. Work on the quality of the data first."

In his paper, Robinson suggests an even more radical tack: Agencies should split off the management of raw data from the management of their Web sites.

"The service people should have to access the data the same way that the public should access the data," Tauberer said. That approach would ensure that data remains in a reusable format and is not tied to a Web site or an application.

For agencies still struggling to get their Web sites to maturity, such separation of concerns might seem radical. But the launch of Data.gov suggests that agencies could be entering a new era of online dissemination of data, and that could require some radical rethinking of how to handle that data.

X
This website uses cookies to enhance user experience and to analyze performance and traffic on our website. We also share information about your use of our site with our social media, advertising and analytics partners. Learn More / Do Not Sell My Personal Information
Accept Cookies
X
Cookie Preferences Cookie List

Do Not Sell My Personal Information

When you visit our website, we store cookies on your browser to collect information. The information collected might relate to you, your preferences or your device, and is mostly used to make the site work as you expect it to and to provide a more personalized web experience. However, you can choose not to allow certain types of cookies, which may impact your experience of the site and the services we are able to offer. Click on the different category headings to find out more and change our default settings according to your preference. You cannot opt-out of our First Party Strictly Necessary Cookies as they are deployed in order to ensure the proper functioning of our website (such as prompting the cookie banner and remembering your settings, to log into your account, to redirect you when you log out, etc.). For more information about the First and Third Party Cookies used please follow this link.

Allow All Cookies

Manage Consent Preferences

Strictly Necessary Cookies - Always Active

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data, Targeting & Social Media Cookies

Under the California Consumer Privacy Act, you have the right to opt-out of the sale of your personal information to third parties. These cookies collect information for analytics and to personalize your experience with targeted ads. You may exercise your right to opt out of the sale of personal information by using this toggle switch. If you opt out we will not be able to offer you personalised ads and will not hand over your personal information to any third parties. Additionally, you may contact our legal department for further clarification about your rights as a California consumer by using this Exercise My Rights link

If you have enabled privacy controls on your browser (such as a plugin), we have to take that as a valid request to opt-out. Therefore we would not be able to track your activity through the web. This may affect our ability to personalize ads according to your preferences.

Targeting cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant adverts on other sites. They do not store directly personal information, but are based on uniquely identifying your browser and internet device. If you do not allow these cookies, you will experience less targeted advertising.

Social media cookies are set by a range of social media services that we have added to the site to enable you to share our content with your friends and networks. They are capable of tracking your browser across other sites and building up a profile of your interests. This may impact the content and messages you see on other websites you visit. If you do not allow these cookies you may not be able to use or see these sharing tools.

If you want to opt out of all of our lead reports and lists, please submit a privacy request at our Do Not Sell page.

Save Settings
Cookie Preferences Cookie List

Cookie List

A cookie is a small piece of data (text file) that a website – when visited by a user – asks your browser to store on your device in order to remember information about you, such as your language preference or login information. Those cookies are set by us and called first-party cookies. We also use third-party cookies – which are cookies from a domain different than the domain of the website you are visiting – for our advertising and marketing efforts. More specifically, we use cookies and other tracking technologies for the following purposes:

Strictly Necessary Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Functional Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Performance Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Social Media Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Targeting Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.