Mike Daconta

COMMENTARY

Big data: You'll have it, but can you handle it?

Cloud computing will produce massive datasets, with a unique set of challenges for government data managers

In 1999, I was called in to troubleshoot a customer’s client/server application that had recently failed a government acceptance test by taking more than 20 minutes to complete queries during stress testing. After months of intense software redesign that included overcoming pushback from a recalcitrant software development team, we were able to increase query performance by 2,000 percent, and the system subsequently passed its acceptance test. 

That experience taught me two hard-fought lessons: First, even though I am a staunch advocate of Donald Knuth’s admonition that “premature optimization is the root of all evil,” performance matters. And second, scalability is hard to achieve. 

Or at least it used to be. Cloud computing is changing that. It is making scalability easier and enabling a proportional increase in the size and scope of data that organizations can process. These two ramifications — instant scalability and the advent of “big data” — are reshaping the computing and information management landscapes. Previously, big data would significantly degrade the scalability of an application, and programmers would therefore introduce throttling mechanisms or look to Moore’s law to bail them out of performance problems. But now, you can have your cake (big data) and eat it, too (scalability)! As we will see, nowhere is the need for processing big data more urgent that in the U.S. government. 


Related articles:

United we map: For GIS storage, bigger is better

HHS starts health data section on Data.gov



For the Defense Department, the ability to rapidly exploit huge volumes of data can mean the difference between life and death. Thus, the Army recently announced it had deployed its first tactical cloud to Afghanistan. The Health and Human Services Department is funding grants to sift the huge volumes of data expected to follow adoption of electronic health care records. Meanwhile, the National Oceanic and Atmospheric Administration and Environmental Protection Agency routinely create huge quantities of sensor data as they monitor the physical environment.

Agency by agency, from the Securities and Exchange Commission to the Justice and Homeland Security departments and every other large government organization, volumes of data are increasing exponentially. They are already struggling with big data, want big data to improve analysis and make better decisions, or a combination of the two.

So what is big data? First, it is not your father’s data. Some examples are cell phone geolocation data, sensor data, surveillance data, Wikipedia text, social media status updates and many other streams of continuous data. These streams might not be record- or document-oriented. Instead, they are often transient or might be aggregated from multiple sources.  

In fact, I am beginning to see big data as an emerging data type with a unique set of properties and challenges. Big data has different meta data and processing requirements, as is evident in parallel processing algorithms such as map/reduce.

With big data, all the common meta data attributes for accuracy, lineage, security and privacy take on increased importance because of the volume of data in question. Meanwhile, parallelization is a key part of processing big data to enable useful results in a reasonable time frame. Along with parallelization, visualization and summarization are core processing techniques for big data.

Given the scale of these datasets, a processing mistake or unauthorized spillage of big data means big trouble. In addition to increased prudence, we must add meta data attributes that are unique to big data, such as granularity, degree of aggregation, use of heuristics and the degree of preprocessing. Other possible meta data attributes that might be applicable include time span, geospatial info, transience, transactional capabilities, and many others.

For government IT managers and CIOs, big data is at the doorstep. Now is the time to rethink your data architectures to accommodate this new type of data. Big data will hold great promise or peril based on your ability to understand and take advantage of it.

Reader Comments

Wed, Apr 27, 2011 Richard Ordowich

Wall Street may be the predictor of the systemic risks associated with "big data". Ratings data was just one example that everyone in this community including government agencies relied upon and accepted as being reliable and proved to be one of the fundamental data weaknesses. Also algorithms control this industry and we have little in the way of tools and techniques to monitor and control these algorithms as exemplified by the “flash crash”.

Few data sets today contain the attributes for accuracy, lineage etc. and yet these data sets are being used as core to many analytics. Cloud computing and big data processing is out ahead but our ability to manage the potential systemic risk of processing all this data in “real time” is not in place or even recognized by many.

We need not only better metadata but also ways to better manage these algorithms that process the data. Improved validation, verification and certification of these algorithms as well as increased transparency of these algorithms must be developed to reduce the systemic risks.

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above