The government’s effort to make data available will require common formats, data governance and a conscious effort to keep the data raw, which means resisting the urge to put it into context.
The White House's recently launched Data.gov site sets a new standard for the presentation of government data. Prepping such data, however, will be no small task.
Last month, the Obama administration unveiled a Web site that offers the public access to data feeds from various agencies. Although the number of feeds is modest, the site’s debut could signal a radical new way in which government agencies must handle and release their data.
Spearheaded by federal chief information officer Vivek Kundra and the CIO Council, Data.gov was assembled in less than three months. The initial offering has links to 47 datasets from a variety of agencies available in a variety of formats. For example, the U.S. Geological Survey has submitted its National Geochemical Survey database, which offers a nationwide analysis of soil and stream sediments, and the Fish and Wildlife Service has presented data about waterfowl migration flyways.
For Web surfers, the site might not seem very exciting. Its intended audience is not casual users but software developers. Federal officials are presenting data in machine-readable formats to make it easy for private or nonprofit organizations or individuals to reuse the information to build applications and services. By providing the raw material, officials hope that "the public sector [can] solve some of the toughest problems that this country faces," Kundra said in an interview with Government Computer News.
By getting Data.gov up and running, the administration is trying to make good on its promise for government transparency and answer the growing call for data-oriented interaction between the public and agencies.
"In order for public data to benefit from the same innovation and dynamism that characterize private parties’ use of the Internet, the federal government must reimagine its role as an information provider," wrote David Robinson and fellow Princeton University academics in the Yale Journal of Law and Technology. "Rather than struggling, as it currently does, to design sites that meet each end-user need, it should focus on creating a simple, reliable and publicly accessible infrastructure that 'exposes' the underlying data."
However, creating such an infrastructure will involve some challenges in the coming years.
Harbinger of change
Data.gov offers data in several formats, including Extensible Markup Language (XML), comma-separated values (CSV), plain text, Keyhole Markup Language and ESRI’s shapefile format for geographical data. The site also offers links to agency Web pages that allow people to query specific databases. For example, it has a link to the General Services Administration's USAspending.gov site, which lets users search information on federal contracts.
Data.gov also has links to various agency widgets — small applications that can be embedded in Web sites. For example, an FBI widget offers brief summaries and photos of the bureau’s 10 most-wanted criminals.
Kundra said Data.gov gives just a taste of what exists at federal agencies. He estimated that the government has more than 10,000 systems, and material from many of them could end up on Data.gov or be accessible to the public in other ways. "The default assumption is that…data that is generated will be fed into Data.gov," he said.
"We don't want to release information that is sensitive in nature or compromise national security in any way,” Kundra added. “But other than that, it should be public."
The government’s job should be to make the data available but not decide what types of data are more valuable than others or better suited for reuse. "We don't have a monopoly on the best ideas or the best approaches,” he said. “For one person, the data may have no value, but for another person, it may be the missing link in solving a very difficult problem."
Of course, extracting data from many agencies’ older systems will prove to be difficult. And as they install new systems, agencies will need to incorporate data-sharing capabilities. Both activities require some planning on the part of federal information technology managers.
What could go wrong?
Getting government data to the point of reusability is no small task. Program managers who want to make their datasets available can start in one of two places: They can contact Data.gov through the Web site’s feedback form, or they can work with their CIOs.
"Databases aren't by their very nature public," said Chris Warner, vice president of marketing at JackBe, which offers mashup software that the Defense Information Systems Agency is using to help intelligence agencies gather information.
Databases don't make much sense to the people who didn't design them or don't use them on a daily basis. But the data they hold is increasingly in demand.
One big user of government data is the GovTrack Web site, which offers updates on congressional bills and information on committee members and their voting records. Joshua Tauberer, the site’s developer, said he gets some of the information by way of Really Simple Syndication feeds. But what would make the site even better is if he had access to the Library of Congress database that holds the information.
"The thing that they should be doing is providing the database as a bulk data download," Tauberer said, referring to the Library of Congress’ Thomas Web site for federal legislative information. "They’ve already got it in XML. There is no reason not to put it out there."
However, some government data sources aren't even encoded in proper databases.
For instance, the Federal Election Commission offers its data as old Cobol files, said Clay Johnson, who leads technical initiatives at the nonprofit Sunlight Foundation’s Sunlight Labs. The organization captures legislative and agency data and reformats it for use by advocacy groups and others.
"It's not up to par," Johnson said of the FEC data feeds. For instance, Cobol doesn't support negative numbers and instead offers a special key that is used to allow another system to produce a negative number after a brief calculation. Cobol also truncates data, such as work titles, which muddies the data even further.
One company with a good view of the state of government data, at least as far as contracting goes, is analyst firm FedSources, which provides agency and contract information to vendors in the government market. The company gets its data from multiple sources, notably GSA’s Federal Procurement Data System.
Ray Bjorklund, senior vice president and chief knowledge officer at FedSources, said the government still has a long way to go before it can offer a wide array of useful data sources.
Despite efforts on the part of Congress a few years ago to bring more transparency to government contracts through efforts such as USAspending.gov, much of the work that FedSources does to collect and present government data is still a manual process.
"We have a system for getting the data," Bjorklund said. “It is the structure of the data that makes it much more challenging [to ingest] in a fully automated way.”
For instance, not all agencies report on their contracts in the same way. As a new method of reporting data is adopted, reports formatted the old way can’t be updated. Also, mistakes can creep into the data, which get compounded as they spread across more reports. Bjorklund said a FedSources analyst noted that an agency reported awarding a $6 billion contract to a small company. The analyst flagged it as a mistake, and the agency corrected it a few months later. But had the mistake not been noticed, such a large number would have skewed contract reports for the agency or the company.
Aggregating data will produce the biggest headaches. One agency will have enough problems ensuring consistency, but combining data from multiple agencies, which would give users a better understanding of governmentwide trends, will be much harder. The task of aggregating data sources into sites such as USAspending.gov is an immense one, even though that site handles one small slice of government information, Bjorklund said.
Enemy of the state
Getting data into a coherent state will require data governance, which takes a considerable amount of work. Multiple agencies will have to agree on how data should be structured, such as through a common taxonomy or ontology, and on a precise dictionary of terms.
"There are lots of islands of data within the government that are well-structured," Bjorklund said. "Will you try to tie all those definitions together and reconcile them? Will you have agencies redesign their data environments to meet some new taxonomy or infrastructure?"
The first step any agency should take is to encode its data in XML.
"It is important to put the data in a format where there is metadata or tags around the data, so when one is looking at the data, they know what they are looking for," said Ken Raffel, senior director of business intelligence at consulting firm Guident. "If you just publish via CSV, there is no easy way to consume the data. But if you have XML, you have tags that describe the data."
Data quality is also important, said Peter Snyder, who maintains OpenRegs.com, which reformats information from the Federal Register and makes it easier to search by grouping it by agency or topic. "We spend a good amount of effort into correcting the mistakes that are in the data," Snyder said. “I make a number of guesses on an abbreviation.”
Another issue agencies should resolve is what data to use.
"Just because you have a data source, this doesn’t means you want to publish all information of the data source,” Raffel said. “You may just want to publish a portion of that."
In that case, IT employees should work with program managers to determine which data is relevant for outside consumption and use business intelligence software to package the reformatted data.
Easy on the polish
Formatting data for the public is a big job, but agencies should beware of spending too much effort shaping the data.
Feeds for Data.gov differ from how data is presented on an agency's Web site, Kundra said. Agency employees understand the audience and can frame the information in the correct context for the agency’s site. But for raw data feeds, the agency should not make assumptions about how the data will be used. "That is a huge distinction," Kundra said.
"People are very tempted to keep [their data].… You don't want to let go of the data until you've made a Web site for it," said Tim Berners-Lee, creator of the World Wide Web, in a recent discussion of government data that was broadcast via the Internet. "What I'd like to suggest is that before you create a beautiful Web site, give us the unadulterated data. We have to ask for raw data."
For instance, before Data.gov was launched, Johnson warned developers against using too many pie charts, graphs and other eye candy because that approach would delay the site’s release and cut down on the amount of data offered.
"Once you start applying visualizations to data, you start applying context to it as well, and we don't think this is the government's job,” Johnson said. “Visualizations are inherently editorial, and I'm not sure that's government's responsibility. I'm not saying don't do visualization, but don't do data visualization first. Work on the quality of the data first."
In his paper, Robinson suggests an even more radical tack: Agencies should split off the management of raw data from the management of their Web sites.
"The service people should have to access the data the same way that the public should access the data," Tauberer said. That approach would ensure that data remains in a reusable format and is not tied to a Web site or an application.
For agencies still struggling to get their Web sites to maturity, such separation of concerns might seem radical. But the launch of Data.gov suggests that agencies could be entering a new era of online dissemination of data, and that could require some radical rethinking of how to handle that data.