How graphics processing units are being used for hard-core number crunching.
When the next version of the Mac operating system, code named Snow Leopard, is released later this year, users might experience some surprising boosts in speeds, at least for some applications. The time it takes, for instance, to re-encode a high-definition video for an iPod could dramatically decrease from hours to a few minutes.
The secret sauce for this boost? Snow Leopard will have the ability to hand off some of the number crunching in that conversion to the graphics processing unit (GPU). The new OS is scheduled to include support for Open Computing Language, which allows programmers to have their programs tap into the GPU.
Typically, the GPU, usually embedded in a graphics card, renders the screen display for computers. But ambitious programmers are finding that GPUs can also be used to speed certain types of applications, particularly those involving floating-point calculations.
For instance, researchers at Belgium’s University of Antwerp outfitted a commodity server with four dual-GPU Nvidia GeForce 9800 GX2 graphics cards. The server would be used to look for ways to improve tomography techniques. They found that this configuration could reconstruct a large tomography image in 59.9 seconds, which is faster than the 67.4 seconds it took an entire server cluster of 256 dual-core Opterons from Advanced Micro Devices.
The cluster cost the university $10 million to procure, whereas the researchers' server only ran $10,000.
For a certain group of problems, GPUs can provide a lot more computational power than an equivalent number of central processing units (CPUs), argued Sumit Gupta, senior product manager for Nvidia’s Tesla line of GPU-based accelerator cards.
In order to render material visual displays, GPUs have been tweaked to do lots of floating-point computations. This sort of computation differs from the integer-based operations that CPUs usually perform insofar that integer computation truncates calculations on the right side of the decimal point, which could lead to small rounding errors. Floating-point operations carry out rounding to 32 bits (and double floating point carries it out for 64 bits). The hard-number crunching of scientific research, in particular, requires the accuracy of floating-point operations.
Graphics cards have always excelled at this kind of floating-point computation, Gupta said. In order to portray tree leaves fluttering in the wind or water trickling over a streambed of the latest computer game, the GPU has to calculate the color, depth and other factors of each screen pixel, which requires heavy matrix multiplication to floating-point precision. These sorts of calculations are not unlike those scientists need to do to solve mathematical conundrums in molecular dynamics, computational chemistry, signals processing and the like.
Nvidia, for one, has seen the interest in having the GPU do double-duty and has modified some of its cards to make them fully programmable. The Nvidia Tesla C1060 computing board is being offered for the scientific crowd. It has one GPU with more than 240 processor cores. It can offer 933 million floating-point operations per second.
To help programmers tap into this computational power, Nvidia has created a package of tools named Cuda. Part of this package is a library for the C programming language, called C Cuda. It offers a number of parallel keywords that developers can use to break off portions of their code to run on the GPU. They just insert the name of the library in their C code, and then they are able to use the functions to signify chunks of the code that can be run in parallel.
Cuda has proven popular with developers. More than 75 research papers have been written on Cuda, and more than 50 universities teach how to use the platform, Gupta said. Certainly, the Cuda sessions were among the best-attended at the SC08 conference in Austin, Texas, last fall.
Even with tools such as Cuda, however, writing for GPUs certainly makes the job of programming a little bit more complicated. For its own developers, government integrator Lockheed Martin, via its Advanced Technology Laboratories, is looking at ways to ease programming in heterogeneous processor environments.
"If you use a GPU, you need to learn the Nvidia compiler and learn how to put the appropriate extensions into your code in the GPU," noted Daniel Waddington, principal research scientist at Lockheed Martin’s labs. He is leading an effort to build what he calls a refactoring engine. Called Chimera, this software will be able to recompile code written in well-known languages so it can be reused across a wider variety of processors without the programmer needing to know the low-level implementation details of the GPUs or other new types of processors.
"The problem is not only are designers moving to multicore processors, but designers are coming out with new designs a few times a year," said Lockheed Martin research scientist Shahrukh Tarapore, who also is working on Chimera. “They have different programming models and different capabilities.”
Right now, Chimera works with the C and C++ languages, which are widely used within Lockheed Martin. If successful, Chimera could be used by the company’s programmers to quickly build programs that can take advantage of the latest processors — be they CPUs, GPUs or even some other design.
"Your source code is first transformed into an abstract syntax tree [so] it can be translated into other forms," explained Tarapore. This approach will also identify which sections can be broken into chunks that could be run in parallel. Those pieces are then pulled from the main body of the program and replaced with pointers to components that can execute the tasks on specific pieces of hardware