Multicore does not mean equal core
As anyone who has worked on a group project knows all too well, not all team members contribute equally to the success of a project. And now Virginia Tech researchers have found the same holds true for the cores in multicore processors.
Depending on how your code is distributed across seemingly identical cores, the speed at which that code is executed on a multicore processor can vary by as much as 10 percent.
If you've ever had a program perform slower than expected or perform quickly on one day and not as spritely the next, you might want to examine how that CPU is executing the job.
"The solution to this is to dynamically map processes to the right cores," said Thomas Scogland, a Virginia Tech graduate student who summarized this quirk at the SC08 conference in Austin, Texas, last month. Scogland and fellow researchers, with help from the Energy Department's Argonne National Laboratory, developed prototype software that could one day help balance performance more equally across all cores. DOE also helped fund the work.
In the past few years, Advanced Micro Devices and Intel have developed multicore chips as a way of boosting performance over previous generations of commodity microprocessors. They have moved from two cores to four or even eight cores per chip.
Developers and systems engineers have mostly expected each core in a multicore processor to have the same effective capability. However, that is not necessarily the case for a variety of reasons, Scogland said.
In all fairness, it's not the cores' fault, technically speaking. Although the cores are identical, how a program is distributed among the cores can affect how quickly it runs. And in most cases, the operating system and hardware spread a program across multiple cores rather arbitrarily, which leads to varying performance.
A number of factors contribute to that variance, the researchers said. One factor is how the CPU hardware handles interrupts. In many cases, they could be directed to a single core, which could slow other applications on that core. However, if the interrupts are distributed across all the cores dynamically, there is no guarantee that the core handling the interrupt will be the same one that is running the program for which that interrupt was intended. Therefore, additional communication time is needed between the two cores.
Memory issues also play a role. On some processors, such as those from Intel, each core gets its own L1 or L2 cache. Although that approach speeds data fetching if the data is in the core’s cache, it could actually increase retrieval time if the data resides on another core’s cache. Multiple cores also mean that data can be blocked from one core while a second core is using that data.
Finally, the programmer can affect the performance time. Oftentimes, with the use of parallel libraries such as the Message Passing Interface, programmers can skew the execution of programs so that they create "unintended out-of-sync communication patterns," the researchers wrote in a paper for the conference titled "Asymmetric Interactions in Symmetric Multicore Systems: Analysis, Enhancements and Evaluation."
All those factors mean that how the program is distributed across different cores can determine how speedily it is executed. Further bedeviling attempts to quantify performance on multicore systems is the fact that those factors can vary from one server to the next, even if the servers are identical. Configuration plays a big role in the variance.
To capitalize on their studies, the researchers developed a prototype of a new performance management library, called the Systems Mapping Manager (SyMMer). It is a process-to-core mapper system that uses a collection of heuristic tools to identify when the problems that the researchers described take place. It can then rearrange the layout of the program's processes on the core so that they are executed in the most efficient manner.
The researchers tested SyMMer against a generic communications library and ran a number of widely used scientific applications, such as the GROMACS and LAMMPS molecular modeling programs.
They found that the SyMMer-based rearrangement of how those programs were executed could bring about a 10 to 15 percent improvement in how quickly an application ran.
"In the end, efficient branch mapping of processes can improve performance," Scogland said.