NSA delves into next level of data analysis
The National Security Agency has launched an initiative to better trace and record the origins and accuracy of data the agency collects and its movement between databases, a discipline known as data provenance, which is becoming more important as the intelligence community attempts to fuse and analyzes troves of data from a variety of sources.
NSA has a pilot initiative that runs on top of a big data, standard cloud architecture that lets the agency track the entire life cycle of data, said Neal Ziring, technical director of NSA’s Information Assurance Directorate.
Big data technologies offer potential advantages for extracting knowledge and actionable information from mountains of raw data, but there are still challenges that government and industry must overcome to reap the benefits, Ziring said.
Keys to big data in the brain, not the computer, former NSA exec says
How cloud can improve intell community's analyses
The intelligence community needs big data technologies to make sense of the complex patterns and behaviors of adversaries who have become more and more sophisticated, he said.
Ziring spoke June 13 at the National Institute of Standards and Technology’s Big Data Workshop in Gaithersburg, Md., citing several challenges that hamper government agencies from getting the full benefit from big data analytics, including the fusing of data from multiple sources, handling data subject to different forms of constraint, supporting analytic multi-tenancy and enabling exploration and discovery.
Often intelligence analysts want to derive actionable knowledge from data but find that there are constraints or restrictions on the data. They get a data feed from a source that comes with strings attached: It might be top secret, privacy protected or subject to legal considerations.
Access control is the simplest piece to this issue wherein people have privileges to access certain information. This is fine if all a person is doing is searching and retrieving. However, with analysis of data there needs to be a way to express conditions in a computation-friendly way.
So an analyst can say, “I want to perform this type of computation; therefore, I want to use this field [of data] but I can’t use that other field,” Ziring said. The capability is there for access control but not for analysis, he said, adding that the intelligence community is working on standardizing the simplest aspects of this area. “This area will benefit from standardization," he said.
It is one thing to put complete classification markings onto data objects when you have 1,000 objects, but it is harder when you have 10 billion objects, he noted.
Data provenance, or data pedigree as some in the intelligence community call it, is a side challenge to dealing with data being subjected to constraints. Such technologies help organizations determine whether data has been compromised and provide metrics to help estimate the reliability of the data.
Can you say, “Yes, senator I do know where that data came from and I can account for its entire life cycle? Can we say that today about all of our data in our big data repositories?” Ziring asked. “In my world some times you get that kind of question.”