Which test is best?
- By Joab Jackson
- Nov 16, 2004
Popular benchmarks can be misleading; experts say the best way to evaluate systems is still to set your own metrics
NASA's Columbia supercomputer is the world's fastest, according to the Linpak benchmark. But, 'we really don't care about Linpack that much,' says NASA's Walt Brooks, 'we're much more interested in reliability, usability, productivity and what really happens with the system.'
Courtesy of SGI
Last month, NASA announced it was running what may well be world's fastest computer. The Columbia, named in honor the crew of the space shuttle Columbia, consists of 20 interconnected SGI 512-processor systems. NASA said the computer reached a sustained peak rate of 42.7 trillion floating-point operations per second.
The announcement followed a declaration last month from IBM Corp. that it had the world's fastest supercomputer, one it plans to deliver to Lawrence Livermore National Laboratories next year. IBM claimed that its system achieved a sustained performance of 36.01 TFLOPS in the IBM laboratory.
IBM's number had edged out the performance of the Earth Simulator in Yokohama, Japan, which had held the title of the fastest computer of the last few years. That unit, built by NEC Corp., can run at 35.86 TFLOPS, according to the simulator manager's submission to the most recent Top500.Org list of the 500 fastest supercomputers.
All these organizations used the same benchmark to test their computers, one called Linpack. One of the dirty little secrets of the government high-performance computing world, though, is that Linpack doesn't measure anything useful, other than how quickly a supercomputer can execute the Linpack test.
Linpack itself is a collection of subroutines that solve linear equations. In the scientific community, it has long been discarded in favor of the Linear Algebra Package, or LaPack.Grading the tests
The focus on Linpak illustrates one of the problems of evaluating technology: Widely accepted performance measures'by which products or systems are rated'may not really apply to what you need. At the PC level, CPU clock speed has been the most common gauge of performance, but other factors'such as CPU architecture'could actually be more important. At the supercomputer level, benchmarks have been the common denominator, but some experts doubt their viability.
'We really don't care about Linpack that much,' said Walt Brooks, chief of NASA's Advanced Supercomputing Division. 'We'll measure it because everyone wants us to measure it, but we're much more interested in reliability, usability, productivity and what really happens with the system.' Columbia, for instance, will do such large jobs as hurricane forecasting, supernova simulation and next-generation aircraft designs.
Others in government echo his sentiment.
'[Linpack] is something you do, but it really does not tell you much,' said David Morton, technical director for the Maui High Performance Computing Center in Hawaii. The center recently powered up a 256-processor Linux cluster to use in simulating global combat operations, so it tested it with Linpack, achieving 600-GFLOPS rating without much tuning.
Muddying the problem somewhat is Linpack's use as the major metric for the Top 500. Twice a year a coalition of computer scientists compiles a list of the world's 500 fastest computers. The list relies on the organizations themselves to benchmark their systems using Linpack and submit the results. Critics have derided the list as responsible for creating an unnecessary global arms race of supercomputers, particularly between the United States and Japan. (Indeed, at press time, the Energy Department has just asserted that one of its own supercomputers is now the world's fastest. BlueGene/L has been upgraded to run at 70.72 TFLOPS.)
'The Top 500 activity has become a marketing yardstick by which you go out and beat your chest,' said Jeff Greenwald, SGI's senior director of server product management and marketing, adding that SGI has numerous systems on the list.
For system designers, benchmarks can provide a way to judge how many resources will be needed. But choosing the correct benchmarks is important, if they want to do more than place high on a performance list.
'Benchmarks should be as realistic as possible and represent the intended environment as closely as possible,' said Arlie Barber, senior engineer for the Army Information Systems Engineering Command's Technology Integration Center, at Fort Huachuca, Ariz.
Linpack in too many cases fails that test, though. As a benchmark, it can give an overall impression of a system's performance, but it can also be misleading, according to Christopher Jehn, vice president of government programs for Cray Inc. of Seattle.
'Linpack has consistency across different platforms, but it does not address how well your particular machine may do one task,' Jehn said.
Another example of a benchmark that has come under scrutiny is processor clock speed. For the past few years, the number of instruction cycles a microprocessor executes in a second has been considered a gauge of a computer's speed. A processor that executes 1 billion cycles per second, or 1 GHz, was generally considered to be faster than one that executed only 200 million, or 200 MHz.
When Advanced Micro Devices Inc. of Sunnyvale, Calif., introduced its Athlon series of chips and started to make inroads into the desktop computer microprocessor market, it called into question the validity of using clock speed as a metric. It claimed its processors could execute programs as quickly as processors of quicker clock speeds made by Intel Corp., thanks to a different processor architecture.
It is a plausible claim.
'Though clock speed does play a role in the performance of a processor it can be deceiving as to the overall performance of a system,' Barber said. 'The size of the instruction set as well as the ability to feed information to the processor are also important.'
One way to measure performance is to mimic real-world workloads, experts said.
'We find our customers are interested in real workload performance'what they will get out of the system. What they will do is create a set of benchmark codes that they believe are representative of their workflow. Then we are asked to either benchmark and furnish the results, or they benchmark themselves on in-house systems,' said Tony Celeste, SGI Federal's national director of defense business.
For performance evaluations of prototype systems, the Army TIC uses Remote Terminal Emulation, Barber said. RTE mimics user loads across a wide range of users on an application across a network. RTE emulates up to 10,000 users using an application, allowing the TIC to vary the amount of data input and other human factors to mimic real-life use.
'As I always say, 'What is the question you are trying to an- swer?'' Barber said. 'If we want to know how many users a given system can handle, the best way is to use the RTE and generate the intended workload'not run a piece of code that stresses the CPU or memory or disk.'
But how much use a system will get can be tricky to calculate.Powering up
'With supercomputing, it can be problematic to determine what your requirements are,' NASA's Brooks said. If you ask the program managers how much computing power they need, they can offer an unrealistically high number. 'So I go with what we've had to use when we tackle the toughest problems,' he said.
For instance, Brooks reviewed how much computing power NASA's Ames Research Center in Moffett Field, Calif., needed to model conditions surrounding the shuttle Columbia's fatal end. The system modeled how loose insulation foam had damaged the craft, calculations that involved simulating complex fluid dynamic flows around the shuttle. The team plotted how much computational power was used, and used those metrics to help determine how much power the supercomputer Columbia would need.
Another factor not usually considered is cost. The fastest equipment can be the most expensive. A program manager can ask if a unit that ranks lower on a benchmark, but is far cheaper, might do the job as well.
This has been a big selling point with commodity technology-based clusters. Linux-based clusters may not offer the performance or efficiency of a dedicated supercomputer. But in many cases the components are so much cheaper than traditional supercomputers that a program manager can just build a bigger, yet cheaper, cluster to do the same job, said Steve Salkeld, director of services for grid software provider Platform Computing Inc. of Basingstoke, U.K.
'The only metric that really matters to us is how much work we can get done for the money,' said Jay Boisseau, who is the director of the Texas Advanced Com- puter Center of the University of Texas. The center looks at way to improve applications through advanced computer techniques. 'We do a lot of benchmarks and then negotiate with vendors to see where the price will be at that time. What we purchase will be the price-performance winner at that time.'
Storage network administrators are also balancing performance against cost, thanks to the emergence of the iSCSI interface. The Storage Networking Industry Association has been pitching iSCSI, which can be used to make IP-based storage area networks, as a less expensive alternative to Fibre Channel SANs.
Can IP SANs transfer data as quickly as the blazing Fibre-Channel SANs, with their promised 2-Gbps throughput? No, but they come close, offering about 80 percent of that speed, according David Dale, chairman of SNIA's Internet Protocol Storage working group. And, according to many vendor estimates, IP SANs can cost about 20 percent of their Fibre-Channel brethren, thanks to the use of commodity equipment.
So for the system administrator, throughput should be weighed against cost. How fast do people need the data on a SAN? How often do they need it? The questions could save a lot of money.
As for Linpack, the Defense Advanced Research Projects Agency has funded the development of what it hopes will be a more comprehensive benchmark, called the High Performance Computing Challenge Benchmark. The HPC Challenge is a package of seven different benchmarks, measuring attributes such as speed of memory updates, performance at attacking mathematical equations and other factors.
Jehn said that the HPC Challenge Benchmark could offer a more accurate measurement of a computer's performance.