Why high-performance clouds are best kept in-house

Agencies with HPC infrastructures have commercial providers beat on cost and performance

Most commercial entities don’t have the infrastructure to handle the intensive workloads of high-performance computing, And, if they do, it will probably be more expensive – in one case, 10 times more expensive – for them to run dedicated services than for some agencies to run their own private clouds.

At least that’s the case for officials from Lawrence Berkeley National Laboratory, the National Institutes of Health, and the National Oceanic and Atmospheric Administration, who spoke at a recent symposium on high-performance computing.

Industry can still play a role in developing software code that is technology-independent, offers better workload management software and provide greener or low-power consumption technology, the officials told attendees at the symposium, sponsored by AFCEA’s Bethesda chapter in Washington, D.C.

The Energy Department’s Lawrence Berkeley National Laboratory compared workloads on Magellan, Energy’s test bed for providing HPC in the cloud, with other commercial cloud providers in the areas of performance and cost, said Katherine Yelick, associate director for computing science at the Berkeley Lab.

Amazon EC2 has a clustered cloud service that is very competitive with Magellan in terms of performance, but would cost 20 cents per CPU hour, and the lab is running workloads at less than two cents per CPU hour. The lab is a non-profit agency so it is charging the government what it costs the lab to run systems.


Related coverage:

7 ways government is working to imporve FedRAMP


The real issue is the cost model in the cloud, where it is more expensive to run dedicated services, Yelick said. The Berkley Lab runs tightly coupled applications in a parallel processing supercomputing environment that solves problems related to alternative energy sources, astronomy, biofuels, and global climate change.

“Our challenge is the cost of using commercial providers,” finding the right cost point for storage and bandwidth, said Adriane Burton, director of the division of computer system services in the National Institutes of Health’s Center for Information Technology.

Going out to a commercial provider would cost two to three times more than using the center’s internal cloud, which serves the needs of scientists and researchers who transmit data across the cloud. The center supports 300 applications that run over a cluster of 9,000 shared processors and has 1.5 petabytes of data in storage.

Application performance is the first thing Berkeley Lab officials look at when purchasing computer systems, Yelick said. As part of the study done on cloud computing, the lab shrunk an application suite down to the size of a midrange job. It was a very small piece of the lab’s workload, a part Yelick’s team thought would be amenable to a commercial cloud setting. However, they discovered that, with a standard setup with virtualization, the applications would run up to 50 times slower.

Many of the labs’ user community wait for huge batch jobs to make their way through the queue. However, some users are doing semi, real-time processing of data that comes from satellites every night. This workload is a bit different, so the lab runs more specialized services. That data might be more suited for the commercial cloud, where the lab could get better utilization for something that is not a uniform workload, Yelick said.

NOAA has a massive network for moving around terabytes and, in some cases, petabytes of data; and commercial entitles might not have the supporting infrastructure, said Joe Klimavicz, CIO and director of high performance computing and communications at NOAA.

The agency, which predicts changes in the Earth’s weather and environment, runs an international private cloud used by 29 countries and researchers within and outside of government, Klimavicz said. NOAA produces 80 teraybtes of scientific data every day, he noted.

Although most of the agency’s data is available for public consumption, system availability, data integrity, security, speed and agility are all important issues. Like Berkeley Labs, NOAA uses 100 percent of its systems. “There’s not a lot of extra capacity. We can consume as much computing as people can make available,” Klimavicz said.

Since these agencies have their own internal clouds, Dimitry Kagansky, chief technology officer of Quest Software Public Sector and moderator of the panel, asked the agency officials how industry can help.

One way is technology independent programming, Klimavicz said. NOAA researchers generate extensive models connected with their task of predicting changes in the Earth’s environment. Often when these models are run on different technology the answers or outcomes vary, he said. Sometimes it can take weeks or even months to determine if researchers are getting the same answer when the models are run on different technology.

“We need a way to develop code that is independent of technology,” Klimavicz said.

Also, NOAA could use better workload management software, the ability to ensure during job scheduling that systems deliver huge workloads to the right place at the right time.

Industry could help NIH with the development of more green systems, those that lower power consumption, Burton said. Additionally, they should keep in mind that they need to develop a relationship with NIH that ensures there is ongoing support for the technical environment not just a one-time affair.

Berkeley Labs worked with Amazon on the cloud comparison studies and the company is taking into consideration some of the Labs suggestion, Yelick said.

Very light-weight virtualization and high-speed networking can be done within a commercial cloud setting for different kinds of workloads, she said.

“We don’t have to provide the kind of urgent access to systems that in general has attracted people to commercial cloud offerings,” Yelick said. Getting that type of elasticity for large job sizes performed by the lab in the commercial cloud is not happening yet.

About the Author

Rutrell Yasin is is a freelance technology writer for GCN.

Reader Comments

Tue, Apr 5, 2011 Chance Reschke University of Washington

We've struggled with quantifying these things at UW and have produced a document that attempts to reveal the total cost of operating an in-house HPC system and to compare those costs to AWS. You can see the document here: http://tinyurl.com/3wasagv Our cost to operate a 12-core node for a year is about $2,800. After normalizing for performance, and assuming zero data storage or transfer costs, the EC2 equivalent comes in at about $8,700/year. Of course, the EC2 cluster isn't really equivalent (no high performance shared storage, no low latency network, etc.), but it's the closest thing they offer.

Thu, Mar 31, 2011

Was curious about this myself but couldn't find a source for the claim. I found what I think is the overall budget for the organization (NERSC) in a DOE budget document (~$55M/year) and the NERSC web site lists its systems and the number of cores. If I take the top 3 (below that is noise) I get 194,736 cores. That would make Total Yearly Budget/Yearly Core Hours ~ 3.2 cents/core hour. Not an exact match but if I'm looking at the right organization it seems close enough to pass the bull$#!+ test.

Tue, Mar 29, 2011 Bill Texas

At the stated value of less than $0.02 per CPU hour, the lab's cost to run an 8-core Nehalem system would be under $1401/year. It would be great to see a breakdown of this cost, i.e. compute hardware, storage, networking, staff, facilities, overhead, etc.

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above