Sandia releases cluster management tool
- By Joab Jackson
- Jan 18, 2007
The Energy Department's Sandia National Laboratories has publicly released a cluster management application under an open-source license. The program, called Ovis
, can monitor the performance and health of individual computers within a cluster or other networked environments.
Ovis is different from the commercial computational platform monitoring and analysis products in that it offers a statistical approach to determining abnormal performance, said Philippe Pebay, a member of the Sandia technical staff who helped develop the software.
Sandia's labs started working on the software two years ago, following a year of preliminary research into the field of intelligence monitoring. Commercial products monitor for absolute thresholds, which are often set at a high level to prevent false positives. For instance, Intel offers software that alerts administrators when the temperature of a CPU goes above a certain point.
When the research body first investigated these tools they found in most cases that they 'were pretty crude,' Pebay said. 'They were not able to address some of the problems we were facing.'
The CPU heat monitor, for instance, is only of limited help in data center environments, where the ambient temperature fluctuates. For instance, it takes into account when processors are running more coolly than normal, perhaps due to a fan that is constantly running. Such a constant-running fan may burn out sooner than expected, overheating the CPU.
Ovis, which runs on Linux, compares individual nodes with one another to derive a statistical norm for all the nodes in a cluster. The software also can also show a visual spatial distribution of a given cluster.
The software collects and correlates a number of environmental conditions, such as CPU temperatures, fan speeds, memory error rates, room temperatures and airflows. Future editions of the software will incorporate Bayesian analysis to allow users to analyze other metrics of their choosing.
The software collects environmental information from the servers by a variety of means. In some cases, it uses the Intelligent Platform Management Interface, an industrywide hardware reporting specification. It also uses vendor-specific metric collecting tools from companies like Hewlett-Packard Co. and Linux Networx Inc. In cases where no commercial metric gathering tools are available, the team crafted scripts that collect data from the sensors and write the results into text files.
Sandia currently runs Ovis on its Thunderbird system. Thunderbird is an 8,960-node Linux cluster running on servers from Dell Inc. and network gear from Cisco Systems Inc. It placed sixth in last November's Top500 semi-annual roundup of the world's most powerful supercomputers, capable of performing 53 trillion floating-point operations per second.
Pebay said the laboratory is hoping other federal agencies use the software and offer feedback and improvements. The software was issued under the Berkeley Software Distribution open-source license.
Joab Jackson is the senior technology editor for Government Computer News.