NASA Ames builds one of the world's fastest supercomputers on the fly, easing a workload traffic jam and introducing real-time supercomputing
TIGHT FIT: Working under a tight deadline, NASA's Columbia supercomputer team retrofitted its facility and made last-minute changes to accommodate the 10,240-processor computer.
The job? Assemble the world's most powerful computer. Deadline: 120 days.
'A miracle a day' was the slogan for the team of NASA employees and supporting contractors who completed a task deemed essential to NASA's continued well-being.
NASA's Advanced Supercomputing Division, located at the Ames Research Center, Moffett Field, Calif., would need those daily blessings. Led by Walt Brooks, the division set out early last year to build a supercomputer on a schedule so tight that one serious mistake would blow the compacted deadline. The task in- volved everything from engineering the system and revamping the facilities to installing and testing the equipment.
'Everyone kept saying it was next to impossible,' said Sid Mair, an SGI branch manager involved in the project. 'A typical program officer would have spent three years to implement a system of that scale.'
Still reeling from the loss tragic loss of the shuttle Columbia, which had disintegrated during re-entry on Feb. 1, 2003, and needing to replenish supplies at the International Space Station, NASA wanted to get the next shuttle, Discovery, aloft quickly.
Before that launch though, the agency needed to know why the tiles on Columbia failed, and that required more computer modeling than NASA could produce.
'The complexity of the problems were such that they would take the [NASA's existing high-performance] computers essentially out of commission for the rest of the agency for three months,' Brooks said.
In fact, the Columbia job was but one of many queuing up at the Advanced Supercomputing Division. The center is the home of supercomputing for NASA, where scientists take the computing jobs that have overwhelmed their own facilities. It had been a few years since the center had a serious upgrade of hardware, though, and the problems were becoming ever more thorny.
'We were in a state where we were limiting the science and research that was going on by the amount of computing we had. We just plain did not have enough high-end computing power for NASA,' said William Thigpen, who was the project lead for the new supercomputer.
For some time, Brooks had been mapping out a next-generation system. After Columbia's tragic demise, those plans were set into high gear. Brooks galvanized others into action. His team quickly got the procurement approvals'from Ames, from NASA itself, from the Office of Management and Budget and from Congress.
As they were lining up funding, they also hammered out the final details on the technology. They liked the promising performance of an existing 512-processor SGI Altix Origin system that SGI was still installing at the time, called Kalpana. 'The scientists we put on it were really getting good results,' Thigpen said.Team building
Could SGI expand that system 20-fold? The team brought in SGI and other commercial partners, such as Intel Corp., which would supply the 64-bit Itanium microprocessors for the job.
Brooks also gathered a dedicated crew of about 50 employees and contractors, as well as countless other people from other Ames divisions. 'Walt is a tremendous energy within the organization,' Mair said. 'He has a unique ability to project a vision and to get everyone to work harder on that project than they ever did before.'
They would need that motivation, because many obstacles loomed.
For one, SGI wasn't planning to roll out the next generation of these systems until the following year. SGI, no stranger to building large machines, nevertheless hadn't filled an order this large. The machine would run 10,240 processors, which would be divided into 20 smaller units.
To accommodate the behemoth, the facility had to be redesigned as well: New power units needed to be installed. New cable needed to be run underneath the raised floor. Studies needed to be done to determine if the cooling system could handle the additional heat of the new hardware.
The first SGI units arrived on the loading dock by the end of June 2004. As each node arrived, it was networked into the others and put to work. Twelve-hour days were common for the NASA team. The systems manager even brought in a cot to spend nights.
'They worked around the clock, seven days a week. They put time in beyond the call of the duty,' said David Barkai, the high-performance computing chief scientist for Intel. As a result, the computer was assembled in record time. By October, the whole assemblage up and running.
Columbia clocked in at almost 52 trillion floating-point operations per second. Measured against the twice-annual Top 500 List of fastest supercomputers, it was the world's fastest computer (although it was pushed to second place shortly afterward). Brooks and company broke Japan's three-year run at the top of the list. Japan's reigning computer, the Earth Simulator, had caused Congress no little worry about the United States losing its competitive edge in high-performance computing.
More important, Columbia had sufficient muscle to crunch the calculations engineers needed to understand what went wrong with its namesake shuttle. The test involved simulating the 2,800-degree temperatures the tiles would undergo when the craft re-entered the Earth's atmosphere. The new system'at the time only partially built'ran 100 of these simulations within 24 hours. That many simulations used to take three months.
'They would pose the question and sometime the next day we'd give them the answer,' Brooks said.
In fact, Columbia has brought NASA into a new era of real-time supercomputing. When the crew of the Discovery had to fix some tiles of the craft while still in orbit, NASA engineers were able to use Columbia to make some vital calculations of what needed to be done'while the shuttle mission was still going on.
This was a first for NASA, Brooks said. Previously, the agency didn't have the capability to do such on-the-spot calculations. The Columbia also allowed the National Oceanic and Atmospheric Administration to make some close-range estimates of the Katrina and Rita hurricanes as they rolled in.
'I think we have one of the best systems in the world, and a very happy user community that is able to do things that they could only dream of doing before,' Thigpen said.