Supercomputers guarding nuclear arsenal lack disaster plan, study finds

GAO raps nuclear agency for lapses

The supercomputer systems used to assess the safety and effectiveness of the U.S. nuclear weapons stockpile do not have adequate contingency and disaster recovery plans in place to ensure availability following a disaster, according to a study by the Government Accountability Office.

The systems are housed in three Energy Department national laboratories: Los Alamos, Sandia and Lawrence Livermore. The National Nuclear Security Administration oversees them. Although all of the labs have components of contingency planning in place, the plans have gaps and have not been regularly tested, threatening the ability of the labs to perform their missions, GAO found.

“NNSA has not provided effective oversight to ensure that the laboratories have comprehensive and effective contingency and disaster recovery planning and testing” GAO concluded. “Further, due to lack of planning and analysis by NNSA and the laboratories, the impact of a system outage is unclear.”


Related coverage:

For continuity, build telework into operations

NSF offers researchers high-performance computing time on advanced computing network


The report goes on to warn that “unless each of the laboratories develops and sufficiently tests comprehensive contingency and disaster recovery plans in accordance with applicable policies and guidance for their classified supercomputing systems, they face a risk of not being able to successfully recover their supercomputing assets and operations after a service disruption.”

NNSA agreed that improvements can and should be made in contingency planning and agreed to develop or improve appropriate planning, analysis and impact testing policies and programs. But NNSA also said that the time frame for reconstituting its supercomputing systems after a disaster was not as urgent as for other critical national security systems, such as national command and control, major financial systems and health and emergency services.

The United States has relied on the supercomputers of its weapons labs to simulate nuclear reactions and evaluate the conditions of existing weapons since the halt of nuclear testing in 1992. Supercomputers are used to determine effects of changes to weapons systems and determine the level of confidence in the performance of future untested systems in the absence of real-world tests. Los Alamos and Livermore are weapons design labs, and Sandia is an engineering lab doing research, design and development of nonnuclear warhead components.

The labs house some of the world’s most powerful supercomputers, and almost all NNSA classified supercomputers operate at the teraflop level (a trillion floating-point operations per second). Los Alamos has three systems, with speeds ranging from 51.2 to 1,280 teraflops; Livermore has six systems from 22.1 to 501.4 teraflops; and Sandia, three systems from 38 to 284 teraflops.

A Distance Computing network provides 10 gigabet/sec secure links for intra- and inter-site file transfers, and the labs have the ability to share supercomputing resources, but GAO found that the minimum capacity needed for each lab's missions is not known, which limits the ability to effectively share for recovery.

The labs have data backup processes in place, but had not fully developed and tested contingency and disaster recovery plans. Only Los Alamos has conducted a business impact analysis to determine the criticality of resources and acceptable outage time frames. All three labs consider the supercomputing systems to be “low impact” in assessing the loss of their availability and do not consider them to be mission-critical.

The Federal Information Security Management Act and regulatory policy require contingency and disaster recovery planning for information systems, and GAO found this has not been fully done despite the fact that “the classified supercomputing capabilities serve as the computational surrogate to nuclear weapons testing and are central to national security.”

The study blamed a lack of accountability and lack of clear lines of responsibility for the problems.

These shortcomings existed, at least in part, because NNSA’s component organizations, including the Office of the Chief Information Officer, were unclear about their roles and responsibilities for providing oversight in the laboratories’ implementation of contingency and disaster recovery planning, the report said. “Until the agency fully implements a contingency and disaster recovery planning program for its weapons laboratories, it has limited assurance that vital information can be recovered and made available to meet national security priorities and requirements.”



About the Author

William Jackson is freelance writer and the author of the CyberEye blog.

Reader Comments

Tue, Dec 14, 2010

Interesting GAO report. Is GAO recommending that the cost of each multi-million or billion $ system be doubled to provide for a hot,warm or cold site system so that live contingency testing can be done? I think you only need to look at recent cost cutting initiatives to see where this will fly. Imagine eliminating 200,000 fed/contractor jobs so that a $1B system can sit for yearly contingency testing.

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above