5 ways agencies can use Hadoop
- By Jonathan Janos
- Apr 10, 2014
The federal government is drowning in big data. Agencies are facing the reality that the government’s legacy IT systems were not designed to handle the volume, variety and velocity at which data is being generated today. When relational databases were created more than four decades ago, they were not intended to handle the massive amounts of data generated by high-definition video, photographic images, blog posts or social media content that is flooding the federal landscape.
Agencies are not only challenged by the complexity and volume of data that exists, but also by the need to analyze and leverage this sea of information to better understand the needs of their constituents.
There are a variety of ways to handle big data, but Hadoop – an open-source programming framework that allows data to be spread across large clusters of commodity servers and processed in parallel – not only handles it well, but faster and for less money than legacy systems.
First, a point of clarification: Hadoop does not replace relational databases and data warehouses. It supplements enterprise data architectures by providing an efficient way to store, process, manage and analyze data. It creates operational efficiencies that can have a positive impact on an agency’s mission and budget.
It is estimated that half the world’s data will be processed by Hadoop within five years. Hadoop-based solutions are already successfully being used to serve citizens with critical information faster than ever before in areas such as scientific research, law enforcement, defense and intelligence, fraud detection and computer security. This is a step in the right direction, but the framework can be better leveraged.
Here are five ways the federal government can take advantage of Hadoop:
Storage and analysis of unstructured and semi-structured content – Agencies can use Hadoop as a storage mechanism for capturing large volumes of unstructured and semi-structured content, free of constraints of relational database technologies. They can then use a variety of tools to parse, transform, visualize and analyze it. This remains the overarching motivation behind Hadoop’s creation and adoption by Google, Yahoo and other large Web properties to support the collection and analysis of data on a grand scale.
Initial discovery and exploration – When new agency needs arise, visual analytic tools are often used to provide a quick initial assessment of data that and how it might be used. These tools enable analysts to work with much greater speed and agility, quickly distinguishing between promising avenues of investigation and analytical dead ends. By fully using the memory and CPUs available on a Hadoop cluster, new visualization technologies can deliver a highly interactive visual playpen for diverse groups of analysts to quickly and easily access enormous quantities of data, slicing and dicing it to provide tools ranging from simple dashboard reporting to sophisticated forecasting and pattern analysis.
Making total data available for analysis – Not only is Hadoop allowing organizations to store more data than ever before, it also is enabling them to better access and analyze it. All of it. Its distributed computing model allows for expanding storage capacity and processing power through the addition of more nodes to the cluster. Advanced analytical algorithms are rapidly being re-engineered to take full advantage of this scalable infrastructure. High-performance analytics can now be used to uncover anomalies or patterns hidden within billions of records. The need for sampling is reduced or even eliminated, producing greater insights in less time and streamlining the entire process for the analyst.
Staging area for data warehouses and analytic data stores – Hadoop can feed data into a traditional data warehouse for business intelligence reporting and online analytical processing (OLAP) or into an analytical data mart for data mining and other advanced analytics. In this scenario, each technology is used for what it does best: Hadoop for storing and processing large volumes of raw data arriving in unpredictable formats, and data warehouse technology for maintaining structured data that can support multiple user groups.
Low-cost, long-term storage of large data volumes – Hadoop is lowering the cost of data storage, allowing many organizations to store data in its original form, just as it was generated and collected, without quite understanding what its value may ultimately be. Hadoop can provide low-cost storage for information that is not presently critical, but may become valuable later in unexpected ways. This information can include transactional data, social media data, sensor data, scientific data, emails and IT system logs. As new uses for the data surface, it is immediately accessible in its original unedited format for further analysis and discovery. The availability of data and the growing sophistication of analytics become self-reinforcing, setting organizations on an upward path of continued process improvement.
It is estimated that more than 90 percent of the world’s data has been created in just the last two years. But the huge volumes, types, sources and security of all this data place new burdens on the current technology infrastructure. Hadoop is positioned to help agencies extract the most value from their data assets, meeting the federal government’s next-generation computing needs and supporting its goal to continually deliver higher quality services.