Another View | A strategy for massive archives
- By Cray Henry
- Jul 02, 2008
Through the Defense Department's High Performance Computing Modernization Program (HPCMP), nearly 4,000 scientists and engineers have stored 6 petabytes of information, and they are expected to add 2 petabytes during the next year. A decade ago, data storage was a sideline activity in supercomputing ' today, it is an essential part of the business.
Each year, the program now generates one-third the amount of information it has produced in its entire 15-year history.
Meeting the next five years' storage requirements will involve increasing the number of machines devoted to storage, improving mechanisms for predicting future storage needs, and possibly integrating algorithms into applications that allow users to catalog and define the storage period for new data.
Since its establishment in the mid-1990s, HPCMP has been able to procure and deploy major high-performance computing systems annually at its four shared-resources centers.
In most years, the newly acquired HPC systems have provided an additional computational capability equal to 70 percent of all HPC systems in the program at the time.
Scientists and engineers generate an ever-increasing amount of data as they solve more complex problems. The program places no restrictions on the amount of data users generate and store ' or how long it is to be retained. Such decisions are left to the programs that sponsor the work.
About a year ago, we formed a storage working group to survey storage management tools and hardware options.
During the next year, we will institute a number of strategies, such as a revised retention policy, reliance on the users to more proactively manage their data and an upgrade of storage systems, including new storage-density technologies.
To budget for storage, we need an annual storage target. Based on past growth and a management decision to try to slow that growth, we plan to add archive storage annually equal to 140 percent of the new data generated the previous year.
We will also encourage customers to manage their own data and make timely decisions about what to keep. The past practice of keeping everything is simply not affordable.
To stay within a reasonable physical space and read and write data in a timely fashion, we have to upgrade our storage technology every few years. Data stored on media more than a few years old must be transferred to the newer, faster, denser media, incurring new costs every time the technology changes.
Customers will also be asked to consider what would be required in the future to re-compute the answers they want to save now. We expect computational costs to be about 10 percent of today's cost in four years.
So the question is: If you need the results again, could you simply re-compute them then instead of storing them? Additionally, our customers say the tools they have for managing their vast amounts of data don't help them make retention decisions. To assist them, we are investigating information life cycle management systems closely coupled with hierarchical storage systems.
We expect that this approach will let users assign metadata to their files to better catalog the age, source, project and program used to generate the data. As users generate new data, they will also be able to define the retention period.
Unfortunately, we have not found any single product that does all this. During the next few months, we will issue a request for proposals for integrated suites of software products to provide the information life cycle management we need.
As greater computing capability creates ever-increasing amounts of data, so too will the need increase for new strategies and information life cycle management tools to assist our customers in managing their data.Cray Henry is director of the Defense Department's High Performance Computing Modernization Program.