How standards helped Oak Ridge tame its data center
- By Kathleen Hickey
- May 04, 2015
The data center at Oak Ridge National Laboratory had an efficiency problem.
According to Scott Milliken, Oak Ridge's computer facility manager, it started with how equipment was being purchased -- via grants. Scientists using the data center were responsible for purchasing equipment for their projects. In some cases, grants specifically stated that upgrades purchased by grant money must be exclusively used for that project, even after the project ended.
So while the ORNL data center was created in 2001 to help departments pool resources and provide space for three supercomputers and the departments’ computer equipment, responsibility for the management, purchasing and upgrading of the equipment in the space was scattered across the lab.
That meant there was no standard, formalized process for buying departmental equipment. Likewise, there were no sitewide upgrades; users only paid for the minimum upgrade they needed for their own projects. The scientists weren’t too keen on the idea of standardization either, because they like they [couldn’t be on the “leading edge, or unique,” Milliken said in his presentation at the recent Data Center World conference in Las Vegas
Those that paid for the technology “[felt] entitled to determine who can use it,” he said. “Fiefdoms [were] created and maintained. Just kidding. They [were] not maintained.”
At the same, the scientists purchasing the equipment were not experts in data center management. Equipment labeling -- done by users -- was confusing. As a result, cables were reused without relabeling, electrical circuits weren’t updated and systems were renamed without documentation.
“The lab built a nice data center, but then simply assumed that the users would know how to act to keep it that way,” said Milliken in an interview at the conference.
The result: unmanageable workflows. There was no way for the technicians to know how much time a job would take or whether the task was completed correctly. There were unplanned outages and no ability to proactively manage issues as the workload was overwhelming.
“Kludges were the norm, rather than the exception,” he said.
The situation needed to change to be manageable. The data center needed to have equipment that was properly documented, standardized and managed by data center staff.
To start, ORNL built an entirely new data center about one and a half years ago. Since it’s been launched, no new equipment goes into the old facility, said Milliken in a Data Center Knowledge article.
To move people away from a “bring your own equipment” approach, Milliken adopted a carrot-and-stick strategy: His group would host the equipment for free if it met the data center’s specifications. Otherwise it would cost money, he told Data Center World. The result was positive. Users started ordering equipment that was within the center’s specifications.
Before, “our biggest pain point was electrical,” Milliken said in the speaker interview. "There was a single source dependency so you had to schedule downtime." The next biggest issues were networking and cooling. “We changed the point of view from an apartment manager to a business hotel manager. In a business hotel, you don’t bring your own furniture. You just bring your luggage and check in.”
The changes are not happening overnight -- and the old data center is not decommissioned. Rather, as equipment reaches its end of life it is gradually phased out, and replacements are installed in the new facility. Once the old facility is empty, the plan is to remodel it into a modern facility, Milliken said.
Milliken described the change as similar to eating an elephant, rather than flipping a light switch. “Lasting change happens when you can show the benefit, rather than barking orders,” he said.
His advice to others: manage expectations, create standards, define the end goal, document what’s in the center, perform regular audits and develop a growth plan.
Data center managers must determine what these standards are, starting with code requirements, recognized public standards and best practices. Cabinets, power strips, cooling and air flow management should all be standardized.
However, “it’s less important what you standardize on and more important that you standardize at all,” he said.
IT managers then need to fully document and audit the data center, determining who owns each piece of equipment, what is misconfigured and how long it will stay, starting with what the center has today, said Milliken. The data center must be in control of the documentation and all changes must be done through the process. The documentation should include an internal operations procedure.
Documentation and standardized installations allowed both users and technicians at ORNL to understand the system and learn where improvements were needed, he said. Technicians were able to operate more efficiently, with pre-printed, accurate labels reducing the time required to fix problems and decreasing unplanned outages.
As a result of standardizing equipment, ORNL reduced downtime, improved both physical and service safety, and created a predictable, repeatable process with a known cost.
Kathleen Hickey is a freelance writer for GCN.