What you need to know about big data

EDITOR'S NOTE: This article was updated Feb. 10, 2012, to correct the name of the Oracle Big Data Appliance.

The king is dead.  Long live the king.

For the past several years, the buzzword was "the cloud."  Now it's "big data."

Thanks to the exploding use of sensors, mobile devices and social networks, coupled with broadband communications, the amount of data collected by government, private enterprise and individuals has far outstripped our ability to analyze it effectively and even to store it.

It has been estimated that Wal-Mart records more than 1 million customer transactions each hour, resulting in more than 2.5 petabytes of data being stored, the equivalent of 167 times the data stored in the Library of Congress. Facebook is reported to store 40 billion photographs. And an estimated 35 hours of video content is uploaded to YouTube every minute.

Related coverage:

Big data spawns new breed of 'data scientist'

Apache Hadoop: Big data's big player

The big deal about big data 

And, of course, the federal government is also a prodigious collector of data. One project alone — NASA's Earth Observing System Data and Information System — has accumulated more than 3 petabytes of data since 2005. "Data volumes are growing exponentially," the President's Council of Advisors on Science and Technology warned in a December 2010 report. "Every federal agency needs to have a 'big data' strategy."

And it's not just the amount of data that presents challenges. The types of data being collected are changing. "Eighty percent of the new information is coming in as unstructured information," said Frank Stein, director of IBM's Analytics Solution Center. "Look at the growth of YouTube and all the documents in PDF files. A huge variety of data types is really adding to the volume of data."

What's more, analyzing video streams for significant content, making it searchable and integrating it with other types of datasets is a feat beyond the reach of traditional relational database systems.

Unfortunately, the price of drowning in data can be significant. Military investigators blamed "information overload" for a Predator drone attack that killed 23 Afghan civilians in February 2010. Drone operators, it was found, could not keep up with monitoring the drone's video feeds while at the same participating in instant messaging and radio communications with intelligence analysts and troops.

While the explosion of digital data collection offers challenges, however, it also offers opportunities. Various agencies at federal, state and local levels have accumulated massive amounts of data on such varied topics as pollen in trees, water quality, disease patterns, transportation infrastructure, weather statistics and satellite photography.

If the tools are developed to integrate these massive datasets across governmental boundaries and make them efficiently searchable, unexpected benefits may emerge. "When you can start integrating datasets you see insights that you never saw before," Stein said.

Hadoop a catalyst

The main reason for all the recent buzz about big data is that those tools are just now emerging.

"In the public sector we have so much data that we do nothing with," said Peter Doolan, chief technologist for Oracle's public-sector division. "We haven't even had the tools to do anything with it. We now do."

What makes it difficult to talk about big data, however, is that it is not a single technology.  Rather, it is a confluence of technological developments that allow analysts to store, manage and analyze large and diverse datasets.

"We call it big data just to give it an easy-to-remember name," said Michael Chui, senior analyst a McKinsey Global Institute, the research arm of the McKinsey & Co. management consulting firm. If there is a key technology enabling the analysis of big data, Chui said, it was the introduction of Apache Hadoop. Hadoop is essentially an open-source framework for rapidly analyzing huge datasets that may reside on multiple commodity computers. It is based on Google's MapReduce analysis engine, which parses data for distributed processing.

"We can take this commodity equipment and use it as a high-performance computer on the cheap," said Dave Ryan, CTO at General Dynamics IT. "The parallel processing idea that was only for the big labs can now be done by Web 2.0 startups." It is that processing power that allows for the analysis of extremely large datasets.

But Chui says that big data isn't just about Hadoop-based data processing. It's also about faster processors, wider bandwidth communications and larger, cheaper storage. "And in addition to all the analytics, how can you make this data consumable?" he asked. "So the idea of visualization or interface technologies to make the results of analysis consumable is a strongly felt need."

Big data in a box

While the main pieces of big data analysis software are available as open-source downloads, the major database vendors are rushing to deliver packaged big data solutions.  "All of these pieces today are available for you if you wish to go ahead and download all of these pieces," said Doolan. "We're trying to put big data in a box. Oracle has announced the thing we are calling the Oracle Big Data Appliance. It has all that software and hardware in one box."

Similarly, IBM offers InfoSphere BigInsights, a Hadoop-based analysis tool.

According to Stein, IBM is also focusing on developing the ability to analyze data streams while they are still in motion. "It takes time to write to disk and put it in your database and draw statements against that," said Stein.  "We're talking about doing all of this on the fly as the data is in motion.  You can actually write a program to tell the box what you want to look for just like you tell a firewall what to look for in terms of viruses or other kinds of things. This is at the early stage."

The potential of big data solutions has agency staff and integrators alike excited.

"As an integrator what I find so interesting is figuring out how we can apply the solution to problems we have had over time in all the different domains," said Ryan, pointing to such uses as traffic management, fraud detection and natural language processing for discovery and other regulatory matters.

Already, a variety of federal agencies are using big data applications. The Office of Personnel Management is using a SAS analytic suite to scan data records from more than 400 health insurance companies participating in the Federal Employees Health Benefits Program for fraudulent claims and other irregularities. The SAS software is also being used to analyze the millions of records in the CMS Chronic Condition Data Warehouse, a repository for Medicare and Medicaid research data.

GCE Federal is currently working on a project that will combine and make searchable procurement data across the entire federal government.  "Imagine if you could combine procurement data from every agency in the government in one big database and have tools on top of it that would allow stakeholders — from public users to government organizations — to be able to go in there and slice and dice through procurement data in a highly intuitive fashion," said GCE Federal CEO Ray Muslimani.

The ability of big data applications to integrate disparate datasets for analysis offers the potential not only for unexpected insights but also for interagency cooperation that can result in major savings in an era of constricting budgets.

"As we are more and more providing Web services that take advantage of the data, there is great opportunity for agencies to collaborate on the resources," said Rob Dollison, program manager with the National Geospatial Program, a unit of the U.S. Geological Survey. "We often have use for the same types of data and in the past we just had to make copies. More and more we are able to find ways to collaborate in the acquisition of the data and how we make it available."

Dollison said USGS already shares a lot of aerial photography with the Agriculture Department. "Anytime a government agency is looking at investing in things we really are required to look at what exists, what other agencies are doing, and how can we collaborate on it," said Dollison. "In most cases there is an awful lot of incentive to collaborate.  And I think the urgency is growing."

Oracle's Doolan agrees. "I think the next two or three years you're going to see a massive leap forward," he said. "It's going to be very interesting."

New ‘data scientists’

The potential of big data also brings with it challenges for agencies and departments.

"These new technologies are coming down the pike," Chui said. "What's important is to understand how you can start to experiment with some of these new or innovative types of technologies. In general you can do it in a relatively small-scale. In some cases that will require some shifting of budget dollars."

It will also, warn other analysts, require a lot more highly trained people, a special challenge for public-sector users.

"You also have to have the right people to be able to interpret the results," said Anne Lapkin, a research vice president with the Gartner Group. "One of the things that we're seeing now is the emergence of something that we are calling the 'data scientist,' who is someone who has an innate understanding of the data, who understands the analytical techniques, who understand statistical analysis and can actually formulate the appropriate queries to get a sensible result. Those are people who are in very short supply in the private sector and the public sector. It's an entirely new skill set."

"Finding the people who can derive actionable insight from large amounts of data is a tremendous challenge," Chui said. His team has estimated a potential gap of approximately 140,000 to 190,000 potential positions. "That suggests some interesting things from a policy standpoint. If you are a leader of a public-sector organization finding that type of talent, motivating them in retaining them will be very important for you if you're to be able to do your job.

“As a policy-maker, if you recognize this is going to be a basis of competition not only for companies but for countries it will be very important to try and make sure that we have the right policies in place so that we are educating and graduating people with these types of skills," he said.

Meta standards

Another challenge of growing importance is the development of standards.

Hadoop and similar data analysis tools are efficient at dealing with massive amounts of data because they focus on manipulating metadata — data about the data — which is far more efficient than moving the data itself around. The problem with that, says Lapkin, is that current big data implementations tend to have the metadata embedded within the analytical code.

"So you can't take that information, for example, and easily integrated with other information another system," Lampkin said. "People who are doing Hadoop/MapReduce implementations are building themselves a new little set of data silos to replace the ones that they had previously."

In getting one of the public sector's highest profile big data projects  — Data.gov — up and running, Marion Royal quickly saw the need for metadata standards. "When we first started we established what we called the 'metadata template,' " Royal said. "Agencies were required to use that template to submit to Data.gov. So that was the beginning of some harmonization across government in defining datasets and how they might be used."

Data.gov, which launched in May 2009 with 47 datasets, now offers more than 400,000 datasets from 172 agencies and subagencies.

"There is still a lot of work to do," Royal said. "We need to define open standards to be able to share the data and make it available to developers who can develop applications that make use of this data regardless of where the data is stored, regardless of whether it is federal or state, because most of the data that citizens might use are typically found on the local and state data sites."

Finally, some analysts warn that, despite the potential power of big data tools, the emerging technology is not a silver bullet.

Lapkin said agency strategists looking at big data should keep their focus clearly on the problems they want to solve or services they want to offer. "Fundamentally, if you haven't defined what the problem is, then throwing people or technology or technique or a buzzword isn't going to do anything except waste money," she said. "Start small and spread out. Always work on very well-defined business outcomes. The stakes are higher and higher because it is very easy to throw huge piles of money at this stuff and not get any return. Everybody gets caught up in the hype."

Stay Connected

Sign up for our newsletter.

I agree to this site's Privacy Policy.