Data lakes: Don’t dive in just yet
- By David Ramel
- Aug 06, 2014
The idea of holding information in a data lake – a large repository of unstructured data from disparate sources available for widespread analytics from various users and applications – is an unfulfilled promise of big data, according to research firm Gartner Inc.
In a recent study, "The Data Lake Fallacy: All Water and Little Substance," Gartner said that while some vendors claim data lakes are essential to capitalizing on big data analytics, there is no common view among these companies about what a data lake is or how it can provide value.
Results of the Gartner report were recently reported by Application Development Trends (ADT), a sister site to GCN.
"In broad terms, data lakes are marketed as enterprisewide data management platforms for analyzing disparate sources of data in its native format," said Nick Heudecker, co-author of the Gartner report.
"The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the up-front costs of data ingestion, like transformation. Once data is placed into the lake, it's available for analysis by everyone in the organization."
However, Gartner co-author Andrew White said that while data lakes – driven by the need for more accessible data analytics – can help certain parts of an organization, such organizations have yet to realize the proposition of effective enterprisewide data management.
Data lakes address two problems, the analysts said. One is the problem of eliminating data silos. The second is a new problem in big data initiatives where types of data vary so much that putting the information into structured data warehouses hinders analysis.
The main risk of using data lakes is the absence of metadata and an underlying mechanism to maintain it, the lack of which can turn a data lake into a "data swamp," Gartner said.
Other risks include security and access control considerations, as data dumped into a lake might have associated privacy or regulatory requirements and shouldn't be exposed without oversight.
These risks, combined with performance considerations, led Gartner to advise companies to "focus on semantic consistency and performance in upstream applications and data stores instead of information consolidation in a data lake."
Reactions to the Gartner report from various vendors in the big data/Hadoop market were mixed, according to the ADT report.
"The data lake is necessary for meaningful big data analytics – for the first time you can bring together diverse multi-structured data (transactions, customer interactions and machine data) without months/years of IT boiling it down to small data," Ben Werther, founder and CEO of big data analytics company Platfora Inc., told ADT.
"It is necessary but not sufficient – the missing piece is the native analytical tools that give frustrated business analysts the self-service iterative workflow to weave together that data for insights not possible with traditional BI tools."
Jack Norris of MapR Technologies Inc., often referred to as one of the "big three" vendors offering Apache Hadoop-based distributions, said, "the cost, efficiency and agility of Hadoop is driving the adoption of data lakes across industries." Norris told ADT.
"Gartner is rightly pointing out that not all Big Data and Hadoop solutions provide the performance, security and data protection capabilities that customers need. MapR is specifically architected to address these enterprise requirements enabling organizations across industries to successfully deploy data lakes."
Nevertheless, the Gartner analysts indicated some new thinking might be required around the concept of data lakes.
"There is always value to be found in data, but the question your organization has to address is this: do we allow or even encourage one-off, independent analysis of information in silos or a data lake, bringing said data together, or do we formalize to a degree that effort, and try to sustain the value-generating skills we develop?" White said.
"If the option is the former, it is quite likely that a data lake will appeal. If the decision tends toward the latter, it is beneficial to move beyond a data lake concept quite quickly in order to develop a more robust logical data warehouse strategy."
This article originally appeared on Application Development Trends, a sister site to GCN.
David Ramel is features editor at MSDN Magazine.