Startup aims to make Census data easier to use
- By Matt Leonard
- Aug 09, 2016
That demographic data that the Census Bureau releases on American society is critical for social scientists and economists and praised for its accuracy. It is not, however, known for its ease of use.
But one Austin, Texas-based startup is hoping to make Census data less dense and more accessible. The social network data.world and National Science Federation-funded fellow Jonathan Ortiz are working toward a more intuitive dataset for the Census’ American Community Survey.
Data.world sells itself as a way to increase collaboration within the realm of data to “accelerate problem solving,” as the company’s cofounder and chief product officer Jon Loyens put it. It’s a social platform that helps people who work with data discover, prepare and share datasets. By linking datasets together using semantic web technology, data.world identifies and adds information about the concepts within the datasets, which makes them easier for people and machines to work with.
The conversation between Census and data.world began when Jeff Meisel, the Census Bureau's chief marketing officer, reached out to the South Big Data Hub, one of four regional innovation hubs established by the NSF that connects government with startups to work collaboratively on data projects.
South Big Data Hub put Meisel in touch with data.world and paved the way for Ortiz’s fellowship.
“Part of our mission is making our public data as easy to use as possible,” Meisel said, adding that Census information is used to make major decisions in both the private and public sectors.
But Ortiz said that even for those with a background in computer science and data, Census information can be difficult to work with.
“It is extremely large,” Ortiz said about the ACS data. “Even when dealing with one year of ACS, it’s already huge. It gets too large to handle on one computer, and that’s when it starts to be considered big data.… It’s accurate, but it’s not the tidiest, easiest dataset that I’ve ever seen.”
What makes the data even more complicated is the supporting information needed to understand the complete set. An example Ortiz used is fuel cost per household in the ACS. The numeral 1 in this category means the cost is included in rent, 2 has another meaning; and then numbers 3 through 9,999 correspond to monetary values.
The Census releases its data in a tabular form. Ortiz explained that with tabular data, a computer doesn’t "know" what it’s looking at. But data.world is changing it to an RDF format that will give the values within a dataset corresponding metadata. With the addition of metadata, computers can interpret a number and its relationship to the whole dataset.
This makes the second part of data.world’s goal more attainable. The website is a social platform where users can comment on and share datasets. Having smarter data makes this collaboration easier for users.
When data scientists begin working with a new set of numbers, the first and most time-consuming task is cleaning up the data set to get a better idea of what it is that they’re looking at. Data.world wants to shorten that first step, specifically for the Census data, Loyens said.
Ortiz said they are “about 85 percent done” with the ACS dataset for the state of Wyoming, which was the first state to make the transition to data.world because its low population generated a proportionally smaller dataset. The changes made in the information for Wyoming will be the same ones used for the rest of the states, so the transition will go much quicker for the remaining states and territories, he said.
Editor's note: This article was changed Aug. 10 to clarify how data.world works with tabular data.
Matt Leonard is a former reporter for GCN.