Mining government data while protecting privacy
- By Stephanie Kanowitz
- May 31, 2019
The personally identifiable information government agencies hold on individuals must to be secured against hackers, but privacy policies also make it unavailable to other agencies. Those restrictions make pulling insights from cross-referenced government data residing in multiple agencies nearly impossible. But now, a project to test two technologies’ ability to analyze data while protecting sensitive information within it shows promise for further development.
A partnership among the Allegheny County, Pa., Department of Human Services, the Bipartisan Policy Center (BPC) and Galois, a tech research and development company, found that it could improve data analysis while protecting individual privacy using secure multiparty computation, a cryptography technique that protects participants' privacy from each other. The partners were also able to provide analysis quickly for those times when decision-makers demand fast responses to their data-based questions.
“Policymakers sometimes want answers to questions very quickly, and as we go forward in terms of analyzing confidential data, we have to be able to calibrate the timeliness of information with also the burden that is imposed by the analysis itself,” said Nick Hart, who was director of BPC’s Evidence Project when the project took place and is now CEO of the Data Coalition. “Part of the reason for doing this was to explore whether the … technology could achieve the goals of privacy protection, but also data analysis in a meaningful way and could we reasonably do it within a public-sector setting,”
The test used millions of county records from five anonymized datasets on homelessness, jail records, medical examiner records, mental health services as well as youth and family services. Although data protection was part of the test, the data was anonymized to make it easier for the county to share, said David Archer, Galois' principal scientist for cryptography and multiparty computation.
The project involved two steps. One was the secure multiparty computation.
“You can think of that as a software-based technology where we can compute on data in such a way that it’s never revealed even to the computer that’s doing the work, but we can still do computations on it,” Archer said. "So, computing where all the data stays encrypted."
The other technology was a trusted execution environment. In this case, Galois had processors that hide data while it’s being worked on.
“Think of it as wrapping a black box around the data so that nothing in the system can look inside the black box except the code inside doing the computation,” Archer said.
Both technologies were run on the experiments to see how they compared and if either worked, he explained.
“We ran several different queries that involved bringing those datasets together and asking questions that involved multiple datasets at once,” Archer said. For example, they asked for the proportion of people who were jailed or homeless in a certain time period who had also used county mental health services. “For all of our experiments, we got exactly the same answers to those experiments as was gotten by running the same questions totally in the clear. What that shows is that computing on encrypted data with these technologies does not affect the accuracy of the results at all.”
Performance was another measure of the project’s success. In the hardware-enabled trusted execution environment, the time to answer the queries was low -- about the same as with a typical database at a fraction of a second.
The multiparty computation ran slower, but Archer said he expected that. Those tests took 10,000 times longer than the trusted execution environment.
“There’s a clear tradeoff here about timeliness of the data analytics, the cost of building the system and what you get as a practical but also privacy-protected result,” Hart said.
That can be an issue, said Erin Dalton, deputy director of the county’s Office of Analytics, Technology and Planning. “If we need to be displaying these data very quickly for decision-making, then I think these technologies might have a way to go,” she said. “But if not -- if these are things that could run overnight or something like that -- then I think they hold great promise for the integration of data.”
The next steps for multiparty computation are emerging, Hart said. Several federal agencies have expressed interest in conducting pilot tests.
“We need more demonstration projects of the technology using different types of data in different contexts to learn about how to best deploy, to how to most efficiently deploy and to most effectively use the approach in ensuring that the data are protected,” Hart said.
Meanwhile, Congress is pushing for the technology in legislation that targets specific policy areas. For example, Sens. Ron Wyden (D-Ore.), Marco Rubio (R-Fla.) and Mark Warner (D-Va.) and a bipartisan group in the House introduced bills last November that would encourage the use of multiparty computation for analyses that would make it easier for prospective college students to get information about a their costs and expected outcomes after graduating from given university with a certain major.
Last July, Rep. Kevin McCarthy (R-Calif.) introduced legislation to create a pilot program at the National Institutes of Health to use multiparty computation at hospitals for vaccine and treatment research on infections caused by soil-borne fungui.
Looking broadly, Dalton sees potential use cases for international efforts such as sharing information between the United States and Russia. She plans to watch as the technology develops.
“We’re always looking, just like everybody else, to enhance our security and make sure we keep up to speed on the latest technologies there, so as they become available, yeah, we’re in,” Dalton said.
Stephanie Kanowitz is a freelance writer based in northern Virginia.