What are big data techniques and why do you need them?
The increase in data volumes threatens to overwhelm most government agencies, and big data techniques can help was the burden.
Big data is a new term but not a wholly new area of IT expertise. Many of the research-oriented agencies — such as NASA, the National Institutes of Health and Energy Department laboratories — along with the various intelligence agencies have been engaged with aspects of big data for years, though they probably never called it that. It’s the recent explosion of digital data in the past few years and its expected acceleration that are now making the term relevant for the broader government enterprise.
At their most basic, big data strategies seek a process for managing and getting value out of the volumes of data that agencies have to grapple with, which are much greater than in the past. In particular, they aim to process the kind of unstructured data that’s produced by digital sensors, such as surveillance cameras, and the smart phones and other mobile devices that are now ubiquitous.
That increase in data won’t slow any time soon. Market analyst IDC, in its “2011 Digital Universe Study,” predicted that data volumes would expand 50-fold by 2020, with the number of servers needed to manage that data increasing by a factor of 10. Unstructured data such as computer files, e-mail messages and videos will account for 90 percent of all the data created in the next decade, IDC said.
Breaking things down into clearly defined constituent parts nets four areas that describe big data issues.
Variety: The data is in both structured and unstructured forms; ranges across the spectrum of e-mail messages, document files, tweets, text messages, audio and video; and is produced from a wide number of sources, such as social media feeds, document management feeds and, particularly in government, sensors.
Velocity: The data is coming at ever increasing speeds — in the case of some agencies, such as components of the Defense Department and the intelligence community, at millisecond rates from the various sensors they deploy.
Volume: The data has to be collected, stored and distributed at levels that would quickly overwhelm traditional management techniques. A database of 10 terabytes, for example, is an order or two less than would be considered normal for a big data project.
Value: The data can be used to address a specific problem or can address a particular mission objective that the agency has defined.
It’s not only the much bigger quantities of data, it’s also the rate at which it’s coming in and the fact that it’s mostly unstructured that are outstripping the ability of people to use it to run their organizations with existing methods, said Dale Wickizer, chief technology officer for the U.S. Public Sector at NetApp Inc.
“The IDC study also says that the number of files are expected to grow 75 times over the next decade,” he said, “and if that’s true, then a lot of the traditional approaches to file systems break because you run out of pointers to file system blocks.”
That massive increase in the quality and velocity of data threatens to overwhelm even those government agencies that are the most experienced in handling problems of big data.
A report in 2011 by the President’s Council of Advisors on Science and Technology concluded that the government was underinvesting in technologies related to big data. In response, six departments and agencies — the National Science Foundation, NIH, the U.S. Geological Survey, DOD, DOE and the Defense Advanced Research Projects Agency — announced a joint research and development initiative on March 29 that will invest more than $200 million to develop new big data tools and techniques.
The initiative “promises to transform our ability to use big data for scientific discovery, environmental and biomedical research, education and national security,” said John Holdren, director of the White House Office of Science and Technology Policy.
Big data is definitely not just about getting a bigger database to hold the data, said Bob Gourley, founder and chief technology officer of technology research and advisory firm Crucial Point and former CTO at the Defense Intelligence Agency.
“We use the term to describe the new approaches that must be put in place to deal with this overwhelming amount of data that people want to do analysis on,” he said. “So it’s about a new architecture and a new way of using software and that architecture to do analysis on the data.”
Those efforts almost always involve the use of Hadoop, he said, which is open-source software specifically developed to do analysis and transformation of both structured and unstructured data. Hadoop is central to the search capabilities of Google and Yahoo and to the kinds of services that social media companies such as Facebook and LinkedIn provide.
From its beginning, Google has had to index the entire Internet and be able to search across it with almost instant results, Gourley said, and that’s something the company just couldn’t do with old architectures and the traditional ways of querying and searching databases.
“Today’s enterprise computer systems are frequently limited by how fast you write data to the disk and how fast you can read data from it,” Gourley said. “You also have to move them around in the data center, and they have to go through a switch, which is another place where things really slow down.”
If you have a data center larger than 1 terabyte, searches could take days for complex business or intelligence applications, he said. “If you were to sequence the entire human genome the old-fashioned way, it could take weeks or months, whereas with big data techniques, it would take minutes,” he added.
However, those techniques are not a panacea for all government’s data problems. Even though they typically use cheap commercial processors and storage, big data solutions still require maintenance and upkeep, so they’re not a zero-cost proposition.
Oracle, as the biggest provider of traditional relational database solutions to government, also provides big data solutions but is careful to make sure it’s customers actually need them, said Peter Doolan, group vice president and chief technologist for Oracle Public Sector.
“We have to be very careful when talking to our customers because most of them will think we’re trying to recast big data in the context of relational databases,” he said. “Obviously, we like to speak about that, but to fully respect big data, we do cast it as a very different conversation, with different products and a different architecture.”
Doolan begins by asking clients about the four Vs listed above and determines whether an agency’s problems fall into those categories. Frequently, he said, clients discover that their problems can be solved with their existing infrastructure. Many times, the problems are related to content management rather than big data.
It’s also a matter of explaining just how complex big data solutions can be.
“Lots of people seem to think that big data is like putting Google search into their data and then surfing through that data just as they would a desktop Google search,” Doolan said. “But a Hadoop script is definitely not a trivial piece of software.”
Along those lines, then, big data solutions should be applied to specific mission needs rather than viewed as a panacea for all of an agency’s data needs. It’s a complementary process rather than a replacement for current database searches.
“We’re at the beginning of the hype cycle on big data,” Wickizer said. “The danger is that more and more enterprises will rush to adopt it, and then the trough of disillusionment will hit as they realize that no one told them of the custom coding and other things they needed to do to make their big data solutions work. This is important as agencies make partner decisions; they need to be critical about who they work with.
However, he also sees big data as an inevitable trend that all organizations will eventually need to be involved in. In light of the mass of unstructured data that’s starting to affect them, “the writing is on the wall for the old approaches,” Wickizer said. “We’re at the beginning of the next 10-year wave [in these technologies], and over that time, we’ll end up doing decision support in the enterprise much differently than we have in the past.”