The differential privacy used to protect census data may make some of the resulting datasets unfit for many typical use cases, including some required by state and federal laws.
Census data can be pretty sensitive -- it’s not just how many people live in a neighborhood, a town, a state or the nation as a whole. Every 10 years, the Census Bureau asks about people’s ages, racial and ethnic backgrounds, personal relationships to others they live with and more. It’s information many people don’t share with neighbors or co-workers, much less the federal government.
People who don’t trust the Census Bureau to keep their data private and secure will be less likely to answer truthfully -- or answer at all.
Federal laws bar the bureau and its employees from sharing data with anyone, including other government agencies like police and the IRS. And the Census Bureau is taking new steps to protect the 2020 census data even more.
Census data can be published only as collections of statistics, but in an age where so many companies are collecting so much data about people, even anonymized statistics can present a privacy risk. Using some of this commercial data, census researchers conducted a simulated attack on their data and were able to match as many as 17% of the people who responded to the 2010 census.
The new protections, however, are raising concerns among community advocates, government officials and scholars who note that the method the Census Bureau is using to increase privacy makes the results less accurate. They worry a more private census may be less useful.
As a geographer who studies how to make and use geographic data, I have been involved over the past decade in efforts to modernize the 2020 census and make it more cost-effective. I see the importance of striking a balance between protecting our privacy and having accurate statistics for data-based decision-making.
An engine of government and the economy
The main purpose of the census, according to the clause of the Constitution that requires it to happen every 10 years, is to count the number of people living in each state, to determine how many members of the House of Representatives each state should get.
That’s easy enough, and could be done without collecting or publishing any personal data at all. But a survey that is supposed to reach every household in the country presents a rare opportunity to ask other questions too. So, from the very first one in 1790, the census has counted more than just noses.
The information it collects -- including ages, racial and ethnic information and home ownership rates -- helps determine how the federal government allocates US$1.5 trillion in spending every year. States, local governments, researchers and businesses also rely on census data to make spending plans and analyze community characteristics.
The U.S. has one of the most accurate and reliable censuses in the world. The resulting data has played a meaningful part in creating the economic prosperity and growth of the United States.
Data science breaks privacy protections
The Census Bureau -- and most statistical analysts too -- used to think that people’s privacy was protected by aggregating data together in large numbers. So the focus was on protecting privacy in small populations. Instead of saying, for instance, there were two Hispanic people in a particular neighborhood, the census data would say there were less than three.
In other cases, the Census Bureau computers swapped the numbers for households in different geographic areas, to mix up the data just a bit. Those changes were minor and didn’t make significant changes to the overall accuracy of the data.
As recently as 2012, scholarly research determined that the risks of revealing one person’s private information in census data was small, as low as 0.04%. But just a few years later, new research turned that finding upside-down.
In 2017 and 2018, the Census Bureau found that a data scientist who had access to commercial and public databases could match that information up with census statistics in a way that could identify as many as 17% of Americans who had completed the 2010 census.
That level of vulnerability was unacceptable to census officials, and the race was on to create better protections in time for the next census.
What is differential privacy?
One of the challenges for officials and scholars like me is that the system is very hard to explain. It’s so complicated that even the scholar who invented it, Harvard computer scientist Cynthia Dwork, has admitted that “It’s a dream of mine to learn how to really explain this so that it’s widely accessible.”
In a nutshell, differential privacy involves not reporting exactly accurate numbers – like “5 people in Bigtown City are Hispanic males” – but rather a random number relatively close to the accurate one, like 11. These random errors make it much harder for a data scientist to go back and figure out which Hispanic male in that city might be connected with a specific public record. And the public has some information, though it’s not exactly accurate or complete.
The system is so complex because it must make sure that all the randomly generated approximations make sense with each other. For example, the number of males plus the number of females must equal the total number of people. And the sum of all county populations in Tennessee must equal the state population of Tennessee.
In addition, to satisfy constitutional requirements, the total population of each state must be exactly correct – not adjusted by differential privacy at all – even though city and county totals may have quite a bit of randomness in them.
A troubling shift
The idea of intentionally adding errors to data is a dramatic change for the census. To help users understand the new method, the Census Bureau produced a test data set, applying differential privacy to the 2010 census results.
I was one of the group of experts who analyzed the test data. Some of what we found was reassuring: State population counts are, by design, completely accurate. And estimates for large populations -- like the number of 20-year-olds in Virginia, or the number of Hispanic people in Los Angeles -- are relatively accurate.
But much of what we found was shocking. Small counts are often unacceptably wrong. In the most extreme case, tiny Kalawao County, Hawaii, a former leper colony that is only accessible by air, sea or mule, had so much randomness added that its population jumped from 90 to 716.
My research group’s findings in Tennessee, where I live and work, showed that these errors could have big effects on local governments. For example, the state of Tennessee uses the census to determine how much money from sales, alcohol and gas taxes to send back to towns. In a typical year, the state sends about $120 per person to each town.
However, the randomness of differential privacy would have created a virtual lottery, with towns receiving anywhere from $80 to $180 per person, instead of an even $120 for everyone. For small rural communities, this could make the difference of whether to repave Main Street or whether to lay off a full-time police officer.
Other disturbing findings include:
- A consistently low count of the number of Native Americans living on reservations.
- A consistently inaccurate increase in the population of rural congressional districts.
- Many counties with statistics that are implausible, like that there are no vacant homes at all.
- Many counties with more households than people, which is impossible.
The general consensus of many experts present was that the test data, protected by differential privacy, are not fit for many uses, including some required by state and federal laws.
Time is running out
The Census Bureau is responding to the criticism raised by the experts, and recent census reports acknowledge that the test results deliver unacceptably inaccurate figures for small towns and for the count of Native Americans living on reservations. However, returning to the old methods is no longer being discussed as an option.
It’s unclear how the Census Bureau might untangle this mess in a way that yields both reliable statistics and reasonable privacy protection. The first deadline to publish small area statistics is March 31, 2021, when the congressional redistricting data are released.
What happens between now and then will determine whether the Census Bureau can solve the problem -- and convince officials, researchers and analysts that its solution is, in fact, useful for all the other purposes census data serve.
This article was first posted on The Conversation.