Sandia releases cluster management tool

The Energy Department's Sandia National Laboratories has publicly released a cluster management application under an open-source license. The program, called Ovis, can monitor the performance and health of individual computers within a cluster or other networked environments.

Ovis is different from the commercial computational platform monitoring and analysis products in that it offers a statistical approach to determining abnormal performance, said Philippe Pebay, a member of the Sandia technical staff who helped develop the software.

Sandia's labs started working on the software two years ago, following a year of preliminary research into the field of intelligence monitoring. Commercial products monitor for absolute thresholds, which are often set at a high level to prevent false positives. For instance, Intel offers software that alerts administrators when the temperature of a CPU goes above a certain point.

When the research body first investigated these tools they found in most cases that they 'were pretty crude,' Pebay said. 'They were not able to address some of the problems we were facing.'

The CPU heat monitor, for instance, is only of limited help in data center environments, where the ambient temperature fluctuates. For instance, it takes into account when processors are running more coolly than normal, perhaps due to a fan that is constantly running. Such a constant-running fan may burn out sooner than expected, overheating the CPU.

Ovis, which runs on Linux, compares individual nodes with one another to derive a statistical norm for all the nodes in a cluster. The software also can also show a visual spatial distribution of a given cluster.

The software collects and correlates a number of environmental conditions, such as CPU temperatures, fan speeds, memory error rates, room temperatures and airflows. Future editions of the software will incorporate Bayesian analysis to allow users to analyze other metrics of their choosing.

The software collects environmental information from the servers by a variety of means. In some cases, it uses the Intelligent Platform Management Interface, an industrywide hardware reporting specification. It also uses vendor-specific metric collecting tools from companies like Hewlett-Packard Co. and Linux Networx Inc. In cases where no commercial metric gathering tools are available, the team crafted scripts that collect data from the sensors and write the results into text files.

Sandia currently runs Ovis on its Thunderbird system. Thunderbird is an 8,960-node Linux cluster running on servers from Dell Inc. and network gear from Cisco Systems Inc. It placed sixth in last November's Top500 semi-annual roundup of the world's most powerful supercomputers, capable of performing 53 trillion floating-point operations per second.

Pebay said the laboratory is hoping other federal agencies use the software and offer feedback and improvements. The software was issued under the Berkeley Software Distribution open-source license.

About the Author

Joab Jackson is the senior technology editor for Government Computer News.


  • business meeting (Monkey Business Images/

    Civic tech volunteers help states with legacy systems

    As COVID-19 exposed vulnerabilities in state and local government IT systems, the newly formed U.S. Digital Response stepped in to help. Its successes offer insight into existing barriers and the future of the civic tech movement.

  • data analytics (

    More visible data helps drive DOD decision-making

    CDOs in the Defense Department are opening up their data to take advantage of artificial intelligence and machine learning tools that help surface insights and improve decision-making.

Stay Connected