In big data, Hadoop rules
Breaking data into manageable pieces allows the unstructured to become structured.
The problem with unstructured data is that, well, it’s unstructured. The kind of data that organizations have learned to manipulate in the past has mostly been structured according to well-understood identifying methodologies. Users have developed data models that allow them to minutely search that data with defined queries.
Unstructured data allows for none of that. Database companies such as Oracle have developed ways to massage some unstructured data so that it can be inserted into relational databases, for example, and be searchable, but that won’t work for the volumes of unstructured data now appearing.
The creators of what eventually came to be Google knew they had to come up with something that would allow for fast, accurate searches of the mass of unstructured data that populates the Web. The initial result was the MapReduce programming model, which is used to process very large datasets by essentially splitting up the processing tasks and assigning them to large clusters of commodity processors and storage.
Hadoop was developed some time after as an open-source implementation of MapReduce and has become the de facto standard for processing large datasets because of its agility and its ability to scale cost-effectively.
“It’s not only a way of doing parallel processing of this data, but it also provides for a way of storing and managing that data on the multiple disks that exist alongside those processors,” said Bob Gourley, chief technology officer at Crucial Point.
Gourley says the core processing capability of Hadoop has now grown to include a full framework of other tools that, as he laid out in a March 2012 paper for Crucial Point subsidiary CTOlabs.com, include a data warehouse infrastructure (Hive), parallel computation capabilities (Pig), scalable distributed databases able to store large tables (HBase), a scalable means of distributing data (Hadoop Distributed File System), and tools for rapidly importing and managing data and coordinating the infrastructure (such as Sqoop, Flume, Oozie and ZooKeeper).
That inexpensive framework of open-source tools, overlaid by a growing universe of powerful analytical tools, points to a likely potential future for big data in government, Gourley believes.
Not all Hadoop implementations are for stand-alone big data projects. Oracle, for example, uses Hadoop as a front end to existing relational databases to reduce large unstructured datasets to something that those existing databases can handle. That helps agencies make even better use of the resources they already have.
“We have built a connector from the [Hadoop framework] into the relational database so we can capture the important stuff through our big data engine and then load it into the relational database so customers can use the skills they do have to do the analytics,” said Peter Doolan, group vice president and chief technologist for Oracle Public Sector. “In that way, we see big data as an acquisition net to capture data that hasn’t been used before.”
Hadoop isn’t for every government organization, however, particularly those that don’t have the resources to manage what is a complicated piece of software that could require a significant amount of custom coding to fit a particular agency’s needs. In that case, there are supported versions of the open-source Apache Hadoop, such as the distribution provided by Cloudera.
There are already significant examples of successful implementations of Hadoop in government. One is the General Services Administration’s USASearch, which is used by more than 550 government websites and won the Government Big Data Solutions Award at the annual Hadoop World Conference in 2011.
Using Hadoop with the Hive open-source data warehousing software fundamentally changed how the system designers thought about data, said Ammie Farraj-Feijoo, GSA's program manager for USASearch, in a story published in Federal Computer Week.
There won’t be a wholesale rush to Hadoop by government agencies, Doolan said, because they have too much invested to just rip and replace. However, particularly for agencies that are handling requests to sort out problems such as fraud, waste and abuse in programs that they don’t know the answers to, “they are starting to take those huge datasets and trawl them for the insights that [the Hadoop framework] can provide,” he said.