The National Nuclear Security Administration is pioneering a strategy for quickly acquiring as much as 3 petaflops of computing capacity to support the operational processing needed to keep its high-performance supercomputers busy.
The National Nuclear Security Administration is acquiring as much as 3 petaflops of high-performance computing capacity to be distributed at three Energy Department labs for handling the day-to-day calculations needed to support their world-class supercomputers.
NNSA will buy as many as 20,000 computing nodes during the next two years through the Tri-Laboratory Linux Capacity Cluster 2 program (TLCC2), assembled in scalable unit clusters. The agency is taking advantage of its purchasing power to make high-performance computing capacity a commodity and reduce the time required to build and field the clusters, and at the same time, it is improving interoperability.
“We’re trying to have a common hardware and software environment,” said Thuc Hoang, head of NNSA’s computing systems and software environment.
A common environment would help NNSA meet what it calls a crushing load of computations needed to ensure the safety and reliability of the nation’s nuclear arsenal and provide redundancy when operations at one lab are interrupted, as when the Los Alamos National Laboratory was threatened recently by wildfires.
TLCC2 is a $39 million contract with options for as much as $89 million. Under it, Appro International will supply its Xtreme-X Supercomputer Linux clusters with Intel Xeon Sandy Bridge processors, along with an InfiniBand interconnect solution from QLogic that includes its 12000 series switches and 7300 series adapters running at quad data rate speeds of 40 gigabits/sec.
As many as 20,000 nodes will be installed at the Los Alamos, Sandia and Lawrence Livermore national laboratories. How many each lab will use has not yet been determined.
“The labs are still trying to figure out what the optimal number of scalable units is,” said Bob Meisner, NNSA’s director of advanced simulations and computing.
NNSA uses supercomputers to run the simulations that have replaced live testing of nuclear weapons, and its requirements have helped to push the limits of supercomputing performance. It divides its requirements into capability computing, which uses most of the capabilities of a world-class supercomputer to run a full simulation, and capacity computing, which is for running smaller simulations or other problems.
“Before you can run a big calculation, you need to run a lot of smaller calculations,” Meisner said. “Our day-to-day work demands a lot of the small” computations. “But our big problems demand the big computers.”
NNSA defines its capability computers as those ranked among the top 10 on the Top500.org list of supercomputers, all of which run at speeds of at least 1 petaflop (one thousand trillion floating point operations per second) or faster. Although the United States has been pushed out of the top spots on the latest list by computers in Japan and China, it still leads the world in overall petaflop computing with five systems.
But despite that performance, “the need for capacity is now so great that it is increasingly difficult to allocate the computer resources required by larger capability runs,” NNSA said in its statement of work for the TLCC2 contract. NNSA policy now is to reserve its most powerful supercomputer resources for full-scale simulations that require from one-half to three-quarters of a computer’s capability, and it uses them for no more than two simulations at a time. This means that additional capacity must be found for the smaller calculations.
To free full-scale supercomputer time, NNSA began investing in large-scale capacity computing for its three labs in fiscal 2007 with the initial Tri-Lab Linux Capacity Cluster contact. One goal of the program was to create a mass market for commercial high-performance computing platforms through volume purchases that would help lower the cost down.
By creating a common platform based on standard elements, “we recognized we could get good value by buying things at the lower end, and we could get some redundancy across the labs,” Meisner said.
The scalable cluster concept worked well enough in the first contract that as much as 3 petaflops of computing power is being acquired in the follow-on.
Each scalable unit will consist of 154 nodes, capable of an overall performance of about 50 teraflops (or 50 trillion floating point operations per second), with every 18 nodes connected to a 36-port edge switch, which in turn is connected to core switches. The units are interconnected through the core to create a fabric operating at as fast as 40 gigabits/sec. The concept allows the vendors to build, test and deliver the scalable units that can be quickly put into operation.
“It limits the amount of integration time,” said Joseph Yaworski, QLogic’s director of high-performance computing product and solution marketing.
The ability of the interconnect fabric to scale while maintaining high performance will determine how many units can be effectively deployed in a single system. Larger systems can be more efficient for handling multiple complex capacity computing problems, “but the larger you build, the interconnect starts stretching,” Meisner said.
The InfiniBand architecture used in the QLogic TrueScale interconnect under the original TLCC contract scaled efficiently in the deployment of more than 4,000 nodes at the Lawrence Livermore lab, with latency of 1 to 2 microseconds.
The computing environment on capability supercomputers typically is specialized and proprietary, Hoang said. Capacity computers have a different set of user tools and software, in this case based on Linux. “That’s why you have different investments for the capability and capacity systems,” she said.
Using the scalable unit model, it takes only a few weeks to get a capacity system up and running, compared with the months needed to bring a capability computer up to speed. This speed, along with a common operating environment being established through the tri-labs program, will make it easier to shift work among the labs and allow them to back each other up in the event of disruptions.
NEXT STORY: Is it time for reduced sign-ons?