Supercomputers guarding nuclear arsenal lack disaster plan, study finds
GAO raps nuclear agency for lapses
The supercomputer systems used to assess the safety and effectiveness of the U.S. nuclear weapons stockpile do not have adequate contingency and disaster recovery plans in place to ensure availability following a disaster, according to a study by the Government Accountability Office
The systems are housed in three Energy Department national laboratories: Los Alamos, Sandia and Lawrence Livermore. The National Nuclear Security Administration oversees them. Although all of the labs have components of contingency planning in place, the plans have gaps and have not been regularly tested, threatening the ability of the labs to perform their missions, GAO found.
“NNSA has not provided effective oversight to ensure that the laboratories have comprehensive and effective contingency and disaster recovery planning and testing” GAO concluded. “Further, due to lack of planning and analysis by NNSA and the laboratories, the impact of a system outage is unclear.”
For continuity, build telework into operations
NSF offers researchers high-performance computing time on advanced computing network
The report goes on to warn that “unless each of the laboratories develops and sufficiently tests comprehensive contingency and disaster recovery plans in accordance with applicable policies and guidance for their classified supercomputing systems, they face a risk of not being able to successfully recover their supercomputing assets and operations after a service disruption.”
NNSA agreed that improvements can and should be made in contingency planning and agreed to develop or improve appropriate planning, analysis and impact testing policies and programs. But NNSA also said that the time frame for reconstituting its supercomputing systems after a disaster was not as urgent as for other critical national security systems, such as national command and control, major financial systems and health and emergency services.
The United States has relied on the supercomputers of its weapons labs to simulate nuclear reactions and evaluate the conditions of existing weapons since the halt of nuclear testing in 1992. Supercomputers are used to determine effects of changes to weapons systems and determine the level of confidence in the performance of future untested systems in the absence of real-world tests. Los Alamos and Livermore are weapons design labs, and Sandia is an engineering lab doing research, design and development of nonnuclear warhead components.
The labs house some of the world’s most powerful supercomputers, and almost all NNSA classified supercomputers operate at the teraflop level (a trillion floating-point operations per second). Los Alamos has three systems, with speeds ranging from 51.2 to 1,280 teraflops; Livermore has six systems from 22.1 to 501.4 teraflops; and Sandia, three systems from 38 to 284 teraflops.
A Distance Computing network provides 10 gigabet/sec secure links for intra- and inter-site file transfers, and the labs have the ability to share supercomputing resources, but GAO found that the minimum capacity needed for each lab's missions is not known, which limits the ability to effectively share for recovery.
The labs have data backup processes in place, but had not fully developed and tested contingency and disaster recovery plans. Only Los Alamos has conducted a business impact analysis to determine the criticality of resources and acceptable outage time frames. All three labs consider the supercomputing systems to be “low impact” in assessing the loss of their availability and do not consider them to be mission-critical.
The Federal Information Security Management Act and regulatory policy require contingency and disaster recovery planning for information systems, and GAO found this has not been fully done despite the fact that “the classified supercomputing capabilities serve as the computational surrogate to nuclear weapons testing and are central to national security.”
The study blamed a lack of accountability and lack of clear lines of responsibility for the problems.
These shortcomings existed, at least in part, because NNSA’s component organizations, including the Office of the Chief Information Officer, were unclear about their roles and responsibilities for providing oversight in the laboratories’ implementation of contingency and disaster recovery planning, the report said. “Until the agency fully implements a contingency and disaster recovery planning program for its weapons laboratories, it has limited assurance that vital information can be recovered and made available to meet national security priorities and requirements.”