When are supercomputers really super?
Interconnect speed is key to high-performance computing
LOW LATENCY: Both SGI, pictured, and Cray high-performance computers use proprietary interconnects.
When the Energy Department's National Energy Research Scientific Computing Center went shopping for a new mid-sized supercomputer, processor speed was a critical factor, as expected. But it also carefully scrutinized the interconnect speeds of the proposed systems. That's because for this supercomputing center, the speed of the conduit between processors was as important as the speed of the CPUs themselves.
'The interconnect is important because a lot of the applications rely on low latency and high bandwidth,' said Bill Kramer, NERSC general manager. 'We run highly parallel applications. One application may make use of 50 or 100 individual nodes.'
Because high-performance computing applications are increasingly spread out over so many processors, how fast they perform comes down in large part to how fast individual nodes can communicate with one another.
Not surprisingly, HPC interconnect makers are jockeying to show who has the speediest, most cost-effective technology for connecting nodes. The field is awash in different adapters, both proprietary and standards-based, and comparing them can be a challenge for even the most adept system architect. And confusing matters even more is the industry practice of tweaking performance results so speed figures look more competitive, and less indicative of what users may actually see. In short, finding the fastest HPC technology takes more than scanning the list of the Top 500 supercomputers (www.top500.org).
'You definitely have to understand what is being measured,' said Greg Thorson, principal engineer for platform development at SGI of Mountain View, Calif.
Shortly after PathScale Inc. of Mountain View, Calif., released its new InfiniPath networking adapter, it dispatched its distinguished scientist Greg Lindahl to give a presentation at a Beowulf Users Group meeting in Washington.
PathScale was not the first HPC interconnect vendor to court the group, which comprises a small but technically savvy collection of HPC system managers, including several from government agencies. SGI, Foundry Networks Inc. of San Jose, Calif., InfiniCon Systems Inc. of King of Prussia, Pa., and Voltaire Inc. of Billerica, Mass., also have presented to the Beowulf group. (Beowulf is a Linux-based clustering platform'and an Old English poem).
In addition to touting the benefits of his company's own interconnect, which is based on the InfiniBand standard, Lindahl warned of the potential pitfalls of comparing performance claims.
When vendors market interconnects, Lindahl said, they typically use a pair of metrics'interconnect bandwidth and interconnect latency. Lindahl calls these 'hero numbers.' Bandwidth measures how much data a network can pass at a time; latency refers to the speed at which an interconnect can relay a packet of data.
'Things are changing in a way that makes latency more important,' Lindahl said, estimating that 30 percent of PathScale's customers and potential customers have applications that require low latency.
'Some applications only need bandwidth, but some applications send out a request for [a] piece of information, and until it comes back, they can't really do anything,' agreed Thorson.
What kind of latency do today's interconnects achieve? When PathScale started shipping its new interconnect last summer, it claimed a latency of 1.32 micro- seconds. SGI says its NUMALink has less than 1 microsecond of latency. According to the InfiniBand Trade Association, InfiniBand operates at 7 microseconds, while Gigabit Ethernet hits 66 microseconds.
The rule of thumb has been that proprietary interconnects, from the likes of Cray and SGI, have the lowest latencies. The downside to these interconnects is that they work only with Cray and SGI systems, respectively.
Interconnects built specifically for HPC environments, such as InfiniBand, Myrinet and Quadrics, also offer fast performance, but they still cost more than such slower, more widely deployed technologies as Gigabit Ethernet and the emerging 10-gigabit Ethernet.
The question that system architects must answer is how much weight they should place on performance figures when building an HPC system.
When measuring latency in fractions of milliseconds, small factors become very large. So it's not surprising that vendors look for ways to boost tested speeds by changing configurations.
'Latency numbers [can] vary dramatically depending on what you're measuring,' said Donald Becker, chief scientist at Scyld Software of Annapolis, Md. 'It's all how you can lie without saying anything that is demonstrably false.'
Apples and oranges
Latency numbers often are difficult to compare because they hide a lot of variation from interconnect to interconnect. One such variation is the type of notification the interconnect uses to alert the system that a data packet has arrived, Becker said. In an interrupt, or event-based mechanism, the adapter alerts the operating system when a new packet has arrived'an approach that is slower but uses a processor efficiently. Typically, this approach is used in 10-G Ethernet interconnects.
Another approach is called polling, in which the processor continuously monitors the memory space to check if a data transmission has been completed. Both InfiniBand and Myrinet typically use polling mechanisms.
'If your CPU is constantly polling, then you will get really low latency, but you're constantly burning up power, whereas if you're interrupt-driven, the CPU is only turned on when it receives notification that it has work to do,' Becker said.
Another source of variation is the size of the packets in a test environment. Most use zero-length packets, or packets with no payload. Lindahl insists that performance numbers based on these tiny packets do not scale evenly and may not convey true performance. Likewise, the number of nodes in a test environment can also be misleading. Most tests involve messages sent back and forth, ping-pong fashion, between two linked machines. Real-world systems have far more nodes.
Another assumption often implicit in performance tests is that neither the computer nor the network has other duties. In fact, a heavily loaded bus (the connection between the HPC card and the CPU) may cause delay as messages wait for right-of-way. Clearly, experts say, there needs to be a better way of measuring HPC performance.
DARPA to the rescue
One objective measure of performance, Lindahl and others have suggested, is the HPC Challenge Benchmark. Developed with Defense Advanced Research Projects Agency funding, the HPC Challenge Benchmark is a set of seven tests that measure the performance of supercomputing systems.
To measure latency, the framework uses the Random Ring Benchmark, which measures both bandwidth and latency. The test involves assembling a group of nodes into a ring network topology and averaging the time it takes a message to get around the ring. Multiple tests are run with random numbers of nodes, which minimizes the chance of the benchmark being gamed or equipment tweaked to one particular configuration. In this way, the test can provide a vendor-neutral way of comparing interconnects.
'If you want to quote a latency, you should publish the Random Ring Latency,' Lindahl said. (See sidebar for a rundown of recent HPC Challenge latency scores.)
All this talk of numbers, of course, obscures other, less quantitative, factors a system architect should ponder when choosing an interconnect. For instance, do you need cards that can be swapped in or out while the machine is up and running? Will the network interconnect scale to the number of nodes you plan to use? In some cases, higher latency may actually be preferable if it means reaping other benefits, such as lower power consumption.
'Interrupt mitigation deliberately adds additional latency in return for lower overall system load,' Becker said. 'That is almost always a good trade-off.'
For NERSC's new system, the decision came down to a number of factors. 'We didn't say we wanted InfiniBand or any other particular interconnect,' Kramer said.
Instead, the center put out a series of benchmarks and a test application. Vendors were asked to build a system to hit those metrics using any technology they chose, keeping in mind that the price of the system had to be as low as possible.
NERSC awarded the contract to Linux Networx Inc. of Bluffdale, Utah. The cluster system is capable of a theoretical peak performance of 3.1 teraflops, with InfiniBand interconnects running among its 722 AMD Opteron processors.
Michael Hall, senior director of customer care and fulfillment at Linux Networx, said there is no right or wrong answer when it comes to choosing an HPC interconnect. Some agencies will take the high-latency, low-cost interconnect, while others require top-notch speed at any cost. The important thing is to compare apples to apples.
'The price-performance balance is really what determines who wins the bill,' Hall said.