Do you speak Hadoop? What you need to know to get started.
It’s a funny word. You have only a vague notion of what it is. You’ve heard that it takes a lot of work but is potentially beneficial. Maybe if you learned more about it you too could enjoy its benefits. But even if you wanted to try it, you wouldn’t even know where to start.
If it isn’t obvious, I’m talking about Hadoop. Not Zumba.
Google first described Hadoop 10 years ago—an eternity in technology—but it is only recently that the rest of us have begun to explore its potential. Even as businesses and government agencies have built Hadoop solutions, it remains a source of confusion to many. In order to assess its potential, IT managers must first understand what Hadoop is.
Hadoop is not just one thing. It is a combination of components that work together to facilitate cloud-scale analytics.
Hadoop provides an abstraction for running analytics on a cluster of commodity hardware when there is too much data for a single machine. The analytics program need not know about the cluster, how work is divided across it, nor the vagaries of cluster management. If a machine fails, Hadoop handles that.
HDFS stands for the Hadoop distributed file system. It’s optimized for storing lots of data across a computing cluster. Users simply load files into HDFS, and it figures out how to distribute the data. Virtually all interactions with Hadoop involve HDFS directly or indirectly.
MapReduce is often mistaken for Hadoop itself, but in fact it is Hadoop’s programming model (commonly in Java) for analytics on data in HDFS. To understand the conceptual foundations of MapReduce, imagine two relational database tables—one for bank accounts and the second for account transactions. To find the average transaction amount for each account, a user would “map” (or transform) the two original tables to a single dataset via a join.
Then all the individual transaction amounts with the same account number would be “reduced” (or aggregated) to a single amount via a “GROUP BY” clause. MapReduce allows users to apply precisely these same concepts to a large data set distributed across a cluster, but the operations can be quite slow as files are continually read and written.
Hive allows users to project a tabular structure on the data so they can eschew the MapReduce API in favor of a SQL-like abstraction called HiveQL. Anyone used to SQL staples like “CREATE TABLE,” “SELECT,” and “GROUP BY” will find HiveQL eases the transition to Hadoop. Familiar abstraction aside, Hive queries run as MapReduce jobs under the hood.
There are other notable components in Hadoop:
HBase. The Hadoop database and an example of the so-called NoSQL databases described in a previous column.
Zookeeper. A centralized service for coordinating activities among the machines in the cluster.
Hadoop Streaming. A MapReduce API that lets developers use popular scripting languages (e.g. Ruby or Python).
Pig. An analytic abstraction similar to Hive but with a query syntax called Pig Latin (yes, seriously), which prefers a scripting pipeline approach to the SQL-like HiveQL.
As powerful as it is, many aspects of Hadoop remain too low-level, error-prone and slow for developers who need higher levels of abstraction. Cascading enables simpler and more testable workflows for multiple MapReduce jobs. Apache Spark lets developers treat data sets like simple lists and uses cluster memory to make jobs run faster. A companion project to Spark, Shark, similarly uses memory to make Hive queries run faster.
Apache Accumulo was originally built by the National Security Agency to supplement HBase with cell-level security. Numerous projects, including the analytics and visualization tool Lumify, are built on Accumulo.
In 2010, Google wrote a paper on Dremel, which facilitates fast queries on cloud-scale data. Dremel supports Google’s BigQuery product and has never been released, but Cloudera’s Impala and MapR’s Apache Drill are open-source implementations.
With network data (e.g. SIGINT or financial transactions) requiring graph analytics, MapReduce can be especially slow. A popular alternative programming model is Bulk-Synchronous Parallel (BSP), which abandons disk I/O for messages sent along the network. Apache Hama and Apache Giraph both use BSP to support graph analytics, and Titan is a graph database that uses HBase and other backends to store cloud-scale graphs.
While these tools excel with batch analytics, and what about analytics on data streaming from a feed like a message queue or Twitter? Apache Storm and Spark Streaming can help there.
All these tools leverage existing Hadoop artifacts like HDFS files and Hive queries. For example, I have written analytics with Spark against an existing HBase table.
Hopefully you can now assess Hadoop to see if it has a place in your enterprise. As for me, I am still trying to wrap my head around Zumba.