Democratization of data? Start with a data-centric architecture
- By Nick Psaki
- May 13, 2019
There is no question that data drives innovation, so it is critical that valuable data be more accessible to those who can leverage it while still being protected. The OPEN Government Data Act is a good first step, but to realize the full potential of data, the right data strategy -- along with parameters in place to keep data secure -- must be in place from the beginning.
The federal government is the single largest enterprise that exists -- it employs more than 3 million people and accounts for 21% of gross domestic product. Each agency holds petabytes on petabytes of data with more being created every moment. This data is not only abundant, it is extremely valuable.
Data is increasingly considered a new currency -- both in government and industry. And, like currency, it cannot realize its true potential if hoarded and locked away. Instead, it must be shared across agencies and applications and become democratized -- a challenge complicated by an environment that combines countless on-prem and cloud-based implementations.
Democratization of data simply means allowing data to be more widely and readily available and removing restrictions on the data. In some cases, these bottlenecks are regulations that prevent data from being accessible. In others, they are processes that only allow data scientists or IT departments to handle data. For the federal government, democratization of data means making data -- such as publicly funded research and information about demographics, crime rates, weather patterns and home values -- widely available. However, data that stems from intellectual property or poses privacy or national security concerns must remain confidential and secure.
The right strategy for secure data democratization is based on a data architecture that unifies on-prem and cloud applications. First and foremost, it should drive application compatibility, so that data can flow. It should have consistent application programming interfaces so developers have a standard way to interact with data on-prem and in the cloud. This compatibility should make migration of applications and security easy, giving users the freedom to run applications where they want and have the data safely follow. And finally, the right data architecture strategy will unlock new use cases like artificial intelligence and real-time analytics.
A data-centric architecture has five key attributes:
- Fast delivery of shared data. Modern systems should be always fast, built on flash, and designed from Day One to be shared because tomorrow’s applications expect shared data.
- On-demand and automated. Standardization and automation of storage architecture will support on-demand consumption and automated delivery to accelerate innovation and reduce costs.
- Exceptionally reliable and secure. The data infrastructure must be able to protect sensitive data.
- Hybrid by design. Storage volumes should easily move to and from the cloud, making application and data migration easy, but also enabling hybrid use cases for application development, deployment and protection.
- Constantly evolving and improving. Users expect the cloud to be reliably accessible, continuously improve and deliver more value every year for the same or lower cost. Storage infrastructure must also be architected for constant improvement without ever bringing users offline.
What’s holding us back?
There was a time when having IT departments or data scientists serve as data gatekeepers made sense. However, as the amount of data explodes and the need for big data to drive innovation expands, this model is no longer sustainable.
Legacy storage has also become a bottleneck for agencies that want to take advantage of big data for real-time intelligence. Within the last few years, the amount of compute required to run bleeding-edge deep learning algorithms has jumped 15-fold, and GPU power has increased 10-fold. By and large, however, legacy storage capabilities have remained stagnant.
The new complexity in which data is now distributed across on-prem, cloud and hybrid environments is also imposing limits. Nowhere is this divide more extreme than at the storage tier. On-prem, dedicated storage arrays with rich features and resiliency deliver a model where the application tends to rely upon the storage infrastructure for resiliency. In the cloud, relatively simple storage services are designed to be shared and scale almost limitlessly -- dictating a very different way to build applications, which often implements much of the resiliency in the application itself. It makes sense: Each of these storage layers was designed for the applications they support. But when building hybrid applications, data becomes a key stumbling block.
Accessible data opens new doors for innovation -- it allows agencies to explore critical metrics about their programs and federally funded research to be leveraged for economic growth, innovation and public health, among other benefits. Here a hybrid cloud strategy will help agencies succeed.
Nick Psaki is principal system engineer with Pure Storage.