How to write apps for multiple cores: Divide and conquer

How do you find a bug in a program when that program runs on more than 200,000 processors?

As incredible as that scenario might sound, it is becoming a routine problem for the Energy Department's Lawrence Livermore National Laboratory (LLNL), home of the 212,992-core BlueGene/L supercomputer. To help spot bugs, laboratory researchers, along with researchers from the University of Wisconsin, have developed a new software program, named the Stack Trace Analysis Tool.

"What we are finding is that today's architectures require novel [debugging] techniques," said Livermore researcher Gregory Lee, who presented a paper about the new software at the SC 08 conference, held in Austin, Texas, last fall.

Although LLNL is an extreme example, most agencies, with their fleets of servers and desktop machines, are facing the same challenge: how to make the most of multiple processing cores within a single chip.

Most computers, except for economy-minded netbooks, now come with multicore processors, CPUs made up of multiple processors side-by-side. Earlier this month, for instance, Advanced Micro Devices unveiled its latest line of desktop PC processors, both of which have two cores per processor: the Athlon II X2 250 and AMD Phenom II.

This move toward multicore processors — and, in the supercomputer world, multiple processors — does not immediately guarantee great performance increases, though. And the IT industry is just beginning to figure out how to rewrite computer programs to get the most from this new architecture.

"I sense there will be a shift in the computer industry as the core count keeps increasing," said Eng-Lin Goh, chief technology officer at SGI. "There will be more emphasis on the software side: The onus will be on trying to scale your code."

From one to many

Since the Ballistics Research Laboratory-funded Electrical Numerical Integrator and Calculator (ENIAC) machine went live in 1946, computers have always completed one instruction at a time.

Of course, since the vacuum tubes used for ENAIC were replaced with transistors on microchips, the speed at which chips carry out their instructions has doubled approximately every 18 months, a trend dubbed Moore's Law, first observed by Intel co-founder Gordon Moore.

But until recently, computers have only presented the illusion of being able to juggle multiple tasks, such as allowing users to play Solitaire while they check their e-mail, thanks to the sheer speed of the processors. A processor running at 2 GHz executes a new instruction 2 billion times a second.

And so, until now, a software programmer's task was relatively easy: Write a series of sequential instructions for processors to execute one after another. If a developer wrote a program that runs too slowly, the joke goes, he or she would merely need to go on vacation for several months, and upon return, the newest chip would run the program with sufficient speed.

Those days have ended, however. In the past few years, Intel and other chip makers found that the more cycles a chip executed per second, the hotter it would get. After chips hit about 3.7 GHz earlier this decade, the heat became unworkable, at least without advanced cooling techniques. So chip makers have switched their strategy: They are advancing their products not by making them faster but by putting more cores onto each chip.

"You are getting more and more cores, but the clock speed has stagnated," Goh said. As a result, "the [emphasis] has shifted from the software developers waiting on the CPU to become faster to one where it is the software developers' work to keep trying to scale their programs from one core to more cores."

For programmers, this multiplicity of cores complicates how they should write programs. Think of the difference between executing a job on a single core and multiple cores as the difference between one person painting a room and a group of people painting a room, said James Reinders, who works at Intel's software products division and wrote a book on parallel programming, "Intel Threading Building Blocks."

One person can do the job sequentially, starting at one side of the room and working through to the other side. Coordinating the efforts of multiple people, however, requires some planning, especially if the job needs to get done in less time.

"The challenge is that we have not, in general, designed our applications to express parallelism," Reinders said. "We haven't designed programming languages to make that easy."

Parallel programming requires attention to two areas, Reinders said. One is decomposing the problem in such a way that it can be run in multiple chunks, side by side. "The art of figuring out how to divide things up so that they are independent is a big challenge," he said. One operation can't be dependent on another operation that hasn't been completed yet. The second area requiring attention is scalability. The programmer does not know how many processors his or her creation will run on, just that it should run on as many processors as are available for the task.

At the JavaOne conference in San Francisco earlier this month, eBay engineer Sangjin Lee sketched out a simple example of how things could go awry by using multiple cores side-by-side. He showed the audience the code for a simple task: Counting the frequency of each word in a file filled with words.

For a single processor, the job is easy: Go word-by-word through the file. On the first occurrence of a word, the program adds that word to a list of all the words, alongside the number "1" to indicate how often that word has appeared in that file. For the subsequent appearance of the word, the program will increment that number by 1.

But for multicore processors with four cores working on the list, this approach could cause confusion if the program is not written with sufficient nuance. For instance, say the program is broken down into two identical programs to run simultaneously on two different cores. The threads on the two different cores might come across the same word at the same time, and both will create an index entry of that new word at the same time. That could crash the computer or result in duplicated entries. Also, if two threads update a word entry at the same time, an incorrect total could result.

Breaking up the problem so that you still get correct results is complicated enough, but there is an additional trick: You must break up the problem in such a way that any gains in efficiency aren't eaten up by the overhead it takes to manage the execution across numerous cores. In other words, you must manage the painters in the room efficiently enough that the time taken to plan their activities doesn't eat up all the time saved by having multiple workers on the job.

This trade-off was perhaps first articulated by IBM computer architect Gene Amdahl. Amdahl observed that the performance gains expected by breaking a task into multiple, simultaneously executed parts will be offset by the additional overhead required to manage this new, more complex way of executing the problem. Engineers now refer to this balancing act as Amdahl's Law.

"Amdahl's Law expresses the law of diminishing returns: The incremental improvement in speedup gained by an improvement of just a portion of the computation diminishes as improvements are added," wrote John Hennessy and David Patterson in their 2006 textbook, "Computer Architecture."

Brainstorming

Fortunately, help is on the way — from both the public and private sectors. Two approaches seem to be taking shape. One is creating entirely new languages from scratch that are better suited to a multicore environment. The other is to write additions to existing languages that can automatically review code and look for places where the program can be broken into sections that can run simultaneously.

The Defense Advanced Research Projects Agency has been working on the issue through its High Productivity Computing Systems program, at least for what is called coarse-grained parallelism, or programs that run across many processors. It has funded the development of a number of new languages that developers could use to write such programs.

One DARPA language created under this model is Chapel, which Cray is developing. Chapel was designed to "reduce the gap between parallel languages and the mainstream" languages, Cray engineer Brad Chamberlain said at the JavaOne conference.

IBM is creating another DARPA-funded language, called X10, that can run on a Java Virtual Machine, making it usable across multiple platforms. Again the focus is on familiarity. The plan was to "start with a sequential programming model that works" and add more elements of concurrency and distribution, said IBM researcher Vijay Saraswat.

Perhaps the most radical approach being tried by the development community is that of using functional languages. Although functional languages, such as Haskell, have been around for decades, they have not been widely used, probably because they are harder to master. In functional languages, programs are broken into a composite of functions, said Vlad Vinogradsky, a senior architect evangelist for Microsoft's public-sector group. Functions "calculate a return value, and that is all that they do," he said. This approach assures that if a value is changed, it won't inadvertently change another part of the program.

Microsoft is working on a functional language called F#. Sun Microsystems, with DARPA's help, also is working on a largely functional language, called Fortress, that follows those same rules.

While DARPA and others create entirely new langauges, other developers are working on modifying existing ones.

At the SC 08 conference, other researchers showed off how they were extending popular languages for parallel duties. For instance, George Washington University researcher Bill Carlson is developing Unified Parallel C (UPC), an extension of the C programming language for parallel environments. University of California Berkeley researchers are working on a dialect of Java, called Titanium. Elsewhere, John Mellor-Crummey presented his work on Co_Array Fortran, an extension of Fortran 95 that is also being prepared for the next version of that language.

Private companies have also taken the initiative to tackle multicore processors. One is, not surprisingly, Intel, which has developed an extension to C++, called Threading Building Blocks. To build TBB, Intel developers rewrote the aspects of C++ that might lead to unpredictable results when used in a multiprocessor or multicore environment, such as memory management.

In use, developers just include a link to the TBB library files in their code headers, and the TBB functions will be compiled into the code. Intel offers an extension to Visual Studio, called Intel Parallel Studio, that supports TBB. Using TBB, programmers don't need to worry about writing for multiple or multicore processors.

Reinders offered an example of how a TBB-enhanced C++ app could work. Say a program is running across all four cores of a quad-core processor. But when another program, such as a virus-checking program, is loaded onto one of the cores, performance of the part of the program running on that one core slows, which in turn slows the entire program. TBB would see that slowdown and move that portion of the program off that core and onto the other three.

The Java community is also trying to make it easier to write Java programs for multicore processors. At the JavaOne conference, Jonas Bonér of Swedish consulting firm Crisp AB, described a number of different approaches.

One is called software transactional memory (STM), which acts like a transactional database. In a database, if a deadlock — when two threads vie for the same resource and lock the system — occurs, the database rolls back to an earlier state, and both parties try to execute their actions again, perhaps on a staggered schedule. Each operation is atomic, meaning that if the entire sequence isn't completed, the program returns to the state it was in before it started the transaction.

Although Java does not offer STM, another language that runs on Java Virtual Machine — a dialect of Lisp called Clojure — does. "Using Clojure, unless the entire set of transactions are completed, [the program] throws an exception," Bonér said. Atomic operations can be constructed within Java, but it requires more expertise on the part of the programmer.

Another method is message-passing concurrency, which seemed to be the most-discussed alternative at this year's JavaOne conference. In that approach, programs are broken down into a series of individual lightweight processes, which communicate with one another through messages. The components, which can also be called actors, have no concurrent operations and instead perform all their operations sequentially.

That messaging approach "raises the abstraction level," Bonér said. “It is easier to reason about. It is easier to see where the problem might be.”

One of the biggest problems with breaking up Java among multiple cores is that threads that are supposed to execute in a certain order might run in a different order because they are running side-by-side, which could lead to incorrect results.

Message passing would eliminate that problem. As an example, Bonér presented a program written in the Scala programming language that emulates a ping-pong game. The game consists of two actors, or lightweight processes. The sole function of each actor is to direct the ball back to the other actor. Breaking down each process into individual components maintains the orderly progress of the game.

Whatever the approach, the goal of writing programs that can run concurrently on several processors or processor cores remains elusive. "Concurrency is complicated. As an industry we are still coming to grips with the programming model," said Sun Microsystems engineer Brian Goetz at the JavaOne conference. "I think there is an incremental path to get there, but I do think we need to change the way we think."

Reader Comments

Sat, Nov 21, 2009 herbs http://www.vitabits.co.uk/chinese-herbs

Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently. There are several different forms of parallel computing: bit-level, instruction level, data, and task parallelism. Parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multicore processors.

Mon, Jun 15, 2009 arty

The issuer of "parallel computing" has been with us for many decades. It is indeed difficult to solve, and the issue is quite technical -- it is not a political question of old thinkers vs young or corporate greed vs the nobility of the free thinker. The problem (in the abstract) is that when you try to have many processors work on a single solution, they can, individually or in groups, interfere with each other slowing the processing. Modern operating systems are written to allocate computational resources to the programs being executed, and that is a primary responsibility for the operating system. It is called multitasking, and when the list of individual programs is large, the OS will tend to allocate all the processors. This is the usual situation, and unless some application or concept like virtualization bypasses the OS, the problem with multiple cores is not a significant one. The situation is different in the fairly unusual case where there is one single but very compute-intensive program (job, task, etc). This means taking a process, perhaps conceived in linear fashion or perhaps never having parallel execution considered, and working it into a form where many aspects of the program can be executed efficiently, in parallel, and produce the same result in less time. That is the crux of the problem discussed in this article, mostly by informing who is working on it rather than telling the technical issues. If the installation is a large compute farm rich in processes, multi-core processors can be treated as you would any multiprocessing server. If the installation is a supe-computer complex dealing with massive computational problems, then chances are you are already formulating the problem for multi-processors, and the multi-core machines fit right in. If the installation is a home office (and you are creating advanced applications for yourself), you might have to apply some of the techniques that the super-computer labs use to writing for your new, 12-core desk side multiprocessor. The literature on this is rich, and it behooves the processor folks (Intel, AMD and their direct customers) to provide lots of help. So perhaps this huge "problem" is really a non-problem.

Fri, Jun 12, 2009 wk edited

Mr. Jackson writes,"Whatever the approach, the goal of writing programs that can run concurrently on several processors or processor cores remains elusive." The reason is not all that hard to understand. Computer science is controlled by a bunch of aging baby boomers who have simply run out of ideas. Why? Because they are old and set in their ways. All the boomers worship a false god called the Turing Machine. Guess what? The Turing computing model is not only useless in the search for a solution to the parallel programming crisis, it IS the cause of the crisis. Surprise! Now, it is obvious that the old computer geeks are not going to accept this truth. The solution (challenge?) for the industry is to encourage those gentle folks to gracefully retire and replace them with younger minds who can leap above the antiquated and bankrupt paradigms of the past and forge a new future. I am particularly annoyed by those who insist that parallel programming should be about decomposing sequential programs into concurrent tasks or threads. This is preposterous. Where did this nonsense come from? If composition is the correct approach to sequential programming, why should parallel programming be any different? The functional programming fanatics are equally out to lunch. FP has been around for ages. If FP were the solution, Mr. Jackson would not be writing about the parallel programming problem at this late stage of the game. But that's a different story. It's time for the computer industry to stop thinking in 20th century mode and jump boldly into the age of massive parallelism where Alan Turing is no longer the king of the hill. That will take more guts than brains, in my opinion. Is Intel, Apple, HP, AMD or Microsoft ready to take the big jump? I don't think so. Nvidia? Maybe. Regardless, sooner or later, a maverick startup will sneak up behind those squabbling old timers and steal the pot of gold. And there shall be weeping and gnashing of teeth.

Thu, Jun 11, 2009

Don't forget about Erlang which as been designed to work on multi-core/processor/node systems. It too uses message passing.

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above