Cray J. Henry | The multicore challenge

Another View | Guest commentary: The makers of multicore processors in PCs could learn a few things from high-performance computing

Cray J. Henry

Before the beginning of this decade you could have spent millions of dollars in new computer hardware each year and never had to consider buying a multicore computer ' unless you were in the market for a high-end supercomputer. Today dual-core machines are common on many desktops, quad-core machines are moving to market now, and the Sony PS3 gaming console has nearly double that number of computational elements supporting the main processor.

During the next several years we'll see processors with ever-larger core counts on the market. Arguably they will be much more powerful. But it raises new questions beyond how best to harness all that computational power. In particular, how can software developers create applications that can use all the cores efficiently to solve your problem faster?

This transition is already in motion and, without a disruptive new processing technology, inevitable. The reason for the move is straightforward ' people buying computers today want machines that are better than what they bought yesterday. But as technology advanced the chipmakers had to change how they define 'better.'

In the final decades of the 20th century the standard metric was the clock speed of a processor. Processors moved inexorably from 10 MHz up through 1 GHz. But as processor clock cycles moved into the GHz range, chip manufacturers encountered two main problems. First, the processors became so much faster than the rest of the computer that the processors had to (and still have to) wait for the information they need to continue working from slower memory systems. Radical changes in processor architecture were able to hide some of this delay, but carried higher design costs and added complexity for the software designers.

The second major problem chip designers encountered in moving past the 1 GHz mark is that faster clock speeds cause disproportionately higher power consumption and heat generation, causing problems for consumers and data centers hosting these machines.

The return of the FLOPS

Faced with a departure from the standard technology improvement cycle, manufacturers started to blend the ideas of capability and speed, resurrecting an older capability measure called FLOPS (Floating Point Operations Per Second). The advantage of FLOPS is that it can describe 'improving' computer performance while clock speeds stagnate or even decline. As manufacturers are able to create smaller and smaller features, space is made available on processor chips that can be used to host additional cores that can do more floating-point operations in a single clock cycle, and today's computers can once again be marketed as 'better' than yesterday's.

The problem, of course, is the software. How can software developers create applications that can use all of the cores efficiently on behalf of the user?
When the clock speeds were going up, the same old programs ran faster, usually with no effort on the part of the software developer. But as cores are added to processors at the same clock speed, software has to be adjusted to take advantage of the new capability. The challenge of writing parallel software has been the key issue for the computational science and supercomputing community for the last 20 years. There is no easy answer; creating parallel software applications is difficult and time consuming.

The convergence of supercomputing and commodity computing

In the supercomputing community we have many applications that can effectively use dozens to thousands of cores, but these applications represent only a tiny fraction of the applications in use around the world today. The real value of multicore machines will not be realized until mainstream software development techniques and practices evolve to encompass the art of parallel programming. The emergence of multicore computers brings this challenge to the forefront.

This is an area in which high-performance supercomputing has an advantage of several decades over the mainstream computing community.

The main challenge in parallelization is dividing a task over all the cores such that they are all working collectively at the same time on your problem. Most applications in use today follow very sequential logical approaches designed to run on one core; multicore developers have to be trained to think of parallel approaches. They are now faced with issues such as how to keep data synchronized as results are computed, shared and used as input to follow-on calculations across tens to thousands of cores. With each core potentially running independently, this is a hard problem especially if you don't want to waste compute cycles on individual cores waiting for the slowest calculation to catch up.

In HPC, there are two main trends supporting the development of parallel software ' the use of special language extensions that support explicit control of communications among individual compute cores (e.g. Message Passing Interface and Open MP) and the specialized parallel languages (e.g. Co-array Fortran and Parallel Unified C) that support both explicit communications and parallel logic constructs.

MPI has become the dominant approach in high-performance technical computing (HPTC) primarily because it is portable across multiple platforms and has a long legacy of support by the vendors. Developers of scientific applications in HPC can expect that successful products will be in use for 20 to 40 years, so they value portability.

But it's not clear that MPI can continue to dominate even in scientific software. Creating an MPI application necessitates very low-level understanding of data and process coordination. This requires significant recoding efforts for existing applications, and the level of detail that has to be managed by the programmer can make getting a verifiably correct software application that scales to tens of thousands of processors (or cores) a very expensive and time-consuming process.

Parallel languages have picked up momentum over the last several years because they offer a path to faster and more straightforward software development but, while their portability is growing, they are not yet as portable or 'future proofed' as MPI.

As the computing community struggles with this latest transition, we're finally at a point where HPC and commodity computing have more than shared chips in common. The trick will be working together to take the best of what we know works on a large scale, avoid trying the techniques we already know don't work, and get a solution faster that benefits us all.

Cray J. Henry is director of the Defense Department's High Performance Computing Modernization Program, E-mail him at cray@hpcmo.hpc.mil. An abridged version of his comments appeared in the Sept. 24, 2007, issue of GCN.

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above