Larry Davis | DOD's lessons in managing the computational workload
- By Larry Davis
- Apr 11, 2008
The Defense Department's High Performance Computing Modernization Program, like many information technology programs, constantly wrestles with the best way to allocate limited resources.
This year, HPCMP will provide hundreds of millions of computational hours to DOD's research, development, testing and engineering communities.
But even at this level, the program can meet only about 40 percent of users' requirements.
As DOD budgets face increasing strains, the need to balance demand for access to the supercomputers with the amount of work that can be accomplished with fixed resources is a constant challenge.
HPCMP has developed a unique method of deciding what projects will have access to which resources by bringing all the stakeholders to the table in a process that has evolved during the life of the program.
But the lessons of that approach might be instructive for other IT managers who face a similar oversupply of computing demands and undersupply of resources.
HPCMP was begun in 1993 to provide supercomputing capabilities for DOD's laboratory and test communities.
In 1997, the program established a set of Challenge Projects designed to encourage 'big science and engineering' by making available large allocations.
Approximately 30 percent of the total HPCMP computational capability is awarded to about 40 Challenge Projects each year, amounting to tens of millions of hours of supercomputing time.
The remaining 70 percent of the program's resources are available to the three services and DOD agencies for allocation to their own projects ' about 600 each year ' according to their individual priorities.
Bringing together the services, agencies and scientists is no easy task. To facilitate this, HPCMP developed a management tool called the Information Environment.
This interactive tool allows service and agency allocators to provide resources to their individual projects, modify those resources as required and even make trades in a kind of computational marketplace.
The Information Environment also handles account creation, tracks usage by project and compares usage with project allocation for each HPC system in the program.
Getting an allocation represents only one dimension of the solution.
Projects with a time-critical component or a need for extremely large resources require preferential service. This is accomplished with six program priorities, ordered from highest to lowest: urgent, debug, high, challenge, standard and background.
With so much going into managing such a large workload, how does HPCMP measure how it's doing? Each priority has an overall target expansion factor on each system. The expansion factor is a metric that relates how long a job spends waiting to the time it takes to execute.
Analysis of fiscal 2007 data indicates that the system is doing an excellent job of getting hours to users according to the level of service promised.
This system manages a large set of users with diverse requirements. It is complex, with many stakeholders having many different agendas.
But it works because the workload allocation system devised during 15 years of operation brings everyone to the table within an ordered framework that lets them ' to the extent possible ' make their own decisions about which projects will receive allocations and how much allocation they get.
It's not a model that will help every IT manager handle competing projects, but it does offer an approach that more of them might need to consider as too many stakeholders chase increasingly limited IT resources.Davis is deputy director at the Defense Department's High-Performance Computing Modernization Program.