Cleansed agency data may still ID individuals
- By Joab Jackson
- May 25, 2004
SEATTLE'Due to potential privacy concerns, the days of federal agencies offering large amounts of detailed statistical data may be quickly coming to an end, predicted Alan Karr, a researcher at the National Institute of Statistical Sciences of Research Triangle Park, N.C.
'The practice of the agencies being able to disclose mass micro-data is possibly on its way out,' Karr said.
Karr was speaking at the fifth annual National Conference on Digital Government Research, being held this week in Seattle. The conference is a forum for participants in the National Science Foundation's Digital Government Research Program to share ideas and present research.
Karr explained that even after an agency's attempts to make data it has collected anonymous, that data can still be used to re-identify individuals.
'If you know the date of somebody's birth and their five-digit ZIP code, and their sex, you can match them in most databases,' Karr said.
As an example, Karr ran anonymous voting record data on a Web site called Anybirthday.com
Given a first name and a last name, this site attempts to return the ZIP code and birthday of the individual. The Web site is run by American Automated Systems Inc. of Canton, Ohio. On the Web site FAQ, the company claims to draw its information from public records.
At present, government agencies such as the Census Bureau offer large tables of statistical data. Although individuals are not identified within these tables, the aggregated data on each anonymous individual can be used to pinpoint that person. As a result, even databases with no identifying information can present privacy risks, Karr said.
Karr leads a National Science Foundation-funded effort by the National Institute of Statistical Sciences to find ways of publicly releasing large amounts of information while protecting the confidential data. The Bureau of Labor Statistics, the Bureau of Transportation Statistics, the Census Bureau and the National Center of Education Statistics participate in the project.
'A lot of federal agencies are trying to balance the confidentiality of the data against disseminating information derived from the data,' Karr said.
Karr presented some of the project's work at the conference, including a data swapping toolkit
that switches attributes of individual data sets, obscuring identifying information. The project team has also developed a technique of performing linear regressions on multiple, separate databases. This technique is useful for agencies that wish to draw information from multiple parties that would not want to transfer information to that agency in fear of a breach of confidentiality, Karr explained.
Joab Jackson is the senior technology editor for Government Computer News.