Reality Check

Blog archive
Hadoop represented by elephant in a data center

Hadoop: The good, the bad and the ugly

Hadoop is a disruptive force in the traditional data management space. However, there are both good and bad sides to the disruption, as well as some ugly marketing hype fueling it.

The good side of Hadoop’s disruption is in the realm of big data. 

Hadoop is an open source, Java-based ecosystem of technologies that exploit many low-cost, commodity machines to process huge amounts of data in a reasonable time. The bottom line is that Hadoop works for big data, functions well at a low cost and is improving every day. A recent Forrester report called Hadoop’s momentum “unstoppable.”

Currently there are hundreds, even thousands, of contributors to the Hadoop community, including dozens of large companies like Microsoft, IBM, Teradata, Intel and many others. Hadoop has proven a robust way to process big data; its ecosystem of complementary technologies is growing every day.

But there’s a bad side to Hadoop’s disruption.

First, its very success is causing many players to jump in, which increases the confusion and pace of change in the technology. The current state of Hadoop is in radical flux. Every part of the ecosystem is undergoing both rapid acceleration experimentation. 

Furthermore, parts of the ecosystem are extremely immature.  When I tech edited the book, “Professional Hadoop Solutions,” I saw firsthand how some newer technologies like Oozie had schemas for configuration files that were very immature and will undergo significant change as they mature.  

Hadoop 2.0 only came out in 2013 with a new foundational layer called YARN.  Now there is Hadoop Spark, a more general-purpose parallel computation approach that is faster than and competes with Hadoop MapReduce.  It is not unrealistic to say that the technology is experiencing both extreme success and extreme churn simultaneously. 

Second, there is immaturity in terms of features that increase the risk of adoption. Technologies like Facebook’s Presto competes with Apache’s Hive, a data warehouse infrastructure built on top of Hadoop.  As with any other emerging technology, it is best to keep away from the bleeding edge of the technology to the more stable core components. 

The ugly side of Hadoop’s disruption is the technology overreach fueled by the marketing departments of numerous new entrants to the Hadoop/big data space. Hortonworks Inc., which focuses on the support of Hadoop and just received a $100 million investment, recently published a whitepaper titled A Modern Data Architecture with Apache Hadoop: The Journey to a Data Lake.

The paper makes the case for augmenting your current enterprise data warehouse and data management architecture with a Hadoop installation to create a “data lake.” Of course, data lake is a newly minted term that basically promises a single place to store all your data where it can be analyzed by numerous applications at any time. 

Basically, it’s a play for “Hadoop Everywhere and Hadoop for ALL DATA.”  To say this is a bold statement by Hortonworks is being kind. The vision of a data lake is not a bad vision – a store-everything approach is worthwhile. However, it is wildly unrealistic to say that Hadoop can get you that today.  Executing successfully on that vision is a minimum of five years out. 

On the positive side, let me add that I do believe that Hadoop can achieve this vision if it continues on its current trajectory – it is just not there today.  For example, the Hadoop File System is geared towards extremely large files, which a store-everything approach would not accommodate.  Additionally, Hadoop’s analysis features are geared to processing homogeneous data like Web logs, sensor data and clickstream data, which is at odds with the vision of storing everything including a wide variety of heterogeneous formats. 

A reality check comparing Hadoop’s current status for handling data management tasks (outside of its big data realm) to mature data management technologies like ETL and data warehouses can only conclude that hyperbole like the Hortonworks whitepaper is a classic case of technology overreach.

So, government IT managers should be wary of technology overreach and focus on known success areas to use the right tools for the right challenge. For right now, Hadoop successfully tackles big data. 

For any other use of Hadoop at this time, your mantra is caveat emptor.

Michael C. Daconta (mdaconta@incadencecorp.com or @mdaconta) is the Vice President of Advanced Technology at InCadence Strategic Solutions and the former Metadata Program Manager for the Homeland Security Department. His new book is entitled, The Great Cloud Migration: Your Roadmap to Cloud Computing, Big Data and Linked Data.

Posted by Michael C. Daconta on Apr 02, 2014 at 11:10 AM


Reader Comments

Sun, Apr 20, 2014 Prasad Hyderabad

To learn hadoop development and succeed in getting hadoop jobs, which training base is better cloudera or hortonworks?

Sun, Apr 6, 2014 Neil C

As a developer who has worked on "big data" projects in the defense space, I would say the first thing you should do is figure out if you actually have a big data problem. Many vendors will tell you that you do so they can sell you a Hadoop solution you may not need. Mr. Daconta is quite right on vendor glut. If you really do have a big data problem, however, Hadoop is far more robust than Mr. Daconta suggests. Many private sector companies thrive on various elements of the Hadoop stack, and the technology has been around for quite some time. But yes, setting up and maintaining a cluster is a pain. I would also mention that this post fails to recognize other big data use cases like streaming data (i.e. Storm and Spark Streaming) and graph data (i.e. Apache Giraph and Titan). There are some factual errors in the post: 1) There is no "Hadoop Spark." There is Apache Spark, an open-source project started at Berkeley, which performs large-scale batch analytics by utilizing cluster memory. Mr. Daconta is correct though that Spark "competes" with MapReduce, which performs large-scale batch analytics by utilizing reads and writes on disk. 2) Facebook Presto does not compete with Apache Hive but rather with Cloudera Impala and Apache Drill, both of which run on top of Hive and are open-source Google Dremel spinoffs. 3) While I may have misunderstood Mr. Daconta's point on heterogeneous data, you can certainly accomplish that with a serialization framework like Apache Thrift, Google Protocol Buffers, Avro, or Parquet among others.

Thu, Apr 3, 2014 BillS United States

Mike, Great summary, as usual. There is great potential in Hadoop but it is not the answer to every data requirement and currently suffers from over hype and a serious lack of qualified people to deliver the current capabilities.

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above

resources

HTML - No Current Item Deck
  • Transforming Constituent Services with Business Process Management
  • Improving Performance in Hybrid Clouds
  • Data Center Consolidation & Energy Efficiency in Federal Facilities