population data

Researchers raise concerns with differential privacy use on census data

After the Census Bureau announced in 2018 that it would use differential privacy to protect the identities of individuals for the 2020 census, researchers at Penn State began to evaluate how these changes could affect census data integrity.

MORE INFO

Explainer: What is differential privacy and how can it protect your data?

By adding random noise to the aggregate data, differential privacy can protect information about individual users while still providing accurate results from database queries. Read more.

Differential privacy injects random "noise" into the aggregate data in an effort to better protect the identities of individual respondents when the data is published.

Nicholas N. Nagle, an associate professor of geography at the University of Tennessee who analyzed census test data, explained the technique this way: “In a nutshell, differential privacy involves not reporting exactly accurate numbers – like ‘5 people in Bigtown City are Hispanic males’ – but rather a random number relatively close to the accurate one, like 11. These random errors make it much harder for a data scientist to go back and figure out which Hispanic male in that city might be connected with a specific public record. And the public has some information, though it’s not exactly accurate or complete.”

Nagle said his analysis showed that state population counts are completely accurate, and estimates for large populations -- like the number of 20-year-olds in Virginia, or the number of Hispanic people in Los Angeles -- are relatively accurate. Data on small populations, however, was “unacceptably wrong,” he said, citing an example of Kalawao County, Hawaii, a former leper colony, which had so much randomness added to its data that its population count jumped from 90 to 716.

The Penn State researchers zeroed in on mortality rates among racial and ethnic minorities and found that, compared with traditional methods of identity protection, using differential privacy on the 2010 census data produced dramatic changes.

"We focused on mortality rate estimates because they are an essential population-level metric for which data are collected and disseminated at the national level and because mortality rates are a critical indicator of population health," Alexis Santos, assistant professor of human development and family studies, told Penn State News.

"We discovered that by using differential privacy, there were both instances of under- and over-counting of the population. In rural areas, there was undercounting of racial and ethnic minorities, while in urban areas there was an overcounting of these populations," he said. In some cases, discrepancies between the two methods of data analysis exceeded a 10% difference.

"This is very concerning because it could impact how much funding programs receive for a specific geographic area," said Santos. "These discrepancies could result in understated health risks in some areas, while overstating in others where there isn't a great need."

According to Santos, the findings highlight the consequences of implementing differential privacy and demonstrate the challenges in using the data products derived from this method.

"The Census Bureau has been very receptive to our research, and demonstrated concern about the accuracy of the data," Santos said. "We plan to move forward with additional research to determine how differential privacy may affect population growth estimates and populations changes from census year to census year. We still have time to fine tune the differential privacy algorithm, and our research will help pinpoint areas of improvement."

About the Author

Susan Miller is executive editor at GCN.

Over a career spent in tech media, Miller has worked in editorial, print production and online, starting on the copy desk at IDG’s ComputerWorld, moving to print production for Federal Computer Week and later helping launch websites and email newsletter delivery for FCW. After a turn at Virginia’s Center for Innovative Technology, where she worked to promote technology-based economic development, she rejoined what was to become 1105 Media in 2004, eventually managing content and production for all the company's government-focused websites. Miller shifted back to editorial in 2012, when she began working with GCN.

Miller has a BA and MA from West Chester University and did Ph.D. work in English at the University of Delaware.

Connect with Susan at [email protected] or @sjaymiller.

Featured

  • Records management: Look beyond the NARA mandates

    Records management is about to get harder

    New collaboration technologies ramped up in the wake of the pandemic have introduced some new challenges.

  • puzzled employee (fizkes/Shutterstock.com)

    Phish Scale: Weighing the threat from email scammers

    The National Institute of Standards and Technology’s Phish Scale quantifies characteristics of phishing emails that are likely to trick users.

Stay Connected

Sign up for our newsletter.

I agree to this site's Privacy Policy.