Keep IT up

What would it take for your data center to go off-line? For one
government agency, all it took was some routine cleaning.

In 2000, the Cincinnati State Technical and Community College
was installing new servers and storage as part of an Active
Directory migration. Just before the equipment went live, however,
the janitor plugged in a vacuum cleaner, which was enough to blow
out the transformer serving the entire wing.

It turned out that the science department had also installed
some equipment on the wing, and, unbeknownst to the systems
administrators, the building was maxing out its electricity
capacity ' and putting its information technology operations
at risk.

Although few data centers are built to withstand a direct
nuclear attack, services should continue despite floods, fires,
hurricanes and blackouts. But outages still hit, and surveys show
that most of the time they are because of acts of incompetence
rather than acts of God.

And individual mistakes are not the only cause.

'The problems in the data center are mostly caused not by
technical issues but by institutional and financial issues,'
said Jonathan Koomey, a scientist at Lawrence Berkeley National
Laboratory and consulting professor at Stanford University who
works on data center power issues. 'Most budgets for IT
equipment are separate from the budget for the facilities, the
infrastructure and the utility bill.'

The key to keeping the data center running is redundancy, both
in terms of equipment and strategy. Here are a few options for
boosting uptime.

Keep current

The first step is to have a good idea of what power and cooling
systems are in place. Overall energy availability is a problem
' the information technology analyst firm Gartner said half
of all data centers will start running short of power in the next
year.

This is not just a problem with the overall power consumption.
Most outages hit particular pieces of equipment in the data center,
not the entire center. Just as one tracks changes in the servers,
so do power and cooling resources need to be audited regularly.

'A data center is a very evolving and fluid environment,
so there is a lot of change management that data center managers
and [chief information officers] need to be aware of,' said
Elaine Wilde, senior vice president of the public-sector unit at
Lee Technologies, which assesses, designs, builds, maintains,
monitors and relocates data centers for dozens of federal
customers.

'It is very important that you do assessments of your
physical infrastructure, much like you do assessments of your
storage and processing capacity or applications that are
mission-critical to running the business of the agency,' she
said. Wilde recommends doing an annual assessment.
'It's like getting a health checkup.'

Even if the data center was originally designed with plenty of
power and cooling five years ago, it probably wasn't designed
to handle racks of today's servers, which have more
processors and are squeezed into smaller form factors. As you add
new racks of blades, you may have to recalculate power
requirements.

Managing expectations

Any data center manager wants 100 percent reliability. Building
a top-notch facility costs money, however. The Uptime Institute, an
industry group that offers data center best practices, estimates
that the most reliable facility, what it calls a Tier IV facility,
requires $22,000 of power and cooling infrastructure for every
kilowatt that gets used for processing. And those power needs keep
increasing.

Rather than build absolute power redundancy into a single data
center and achieve 99.999 percent electrical reliability at
considerable cost, it might be better to have a Tier II or Tier III
primary facility with a backup data center it can fail over to. You
can also target critical parts of the electrical infrastructure
that are cost-effective to address and recognize that you may
experience some downtime.

This is the approach used by the Defense Department's
Aeronautical Systems Center's Major Shared Resource Center
(ASC MSRC) at Wright-Patterson Air Force Base, Ohio, which houses
several supercomputers. The center has an uninterruptible power
supply battery (UPS) system that gives the center 20 to 30 minutes
to shut down the servers or ride out a short power outage.
Technical Director Jeff Graham said some of the other MSRCs have
larger battery and diesel generator systems that keep
their systems running.

'It is becoming very expensive to do that because
these systems are going almost exponential
in terms of the cooling and power they require,'
Graham said. 'We have taken this other
innovative approach to try to bring in the same
kind of availability and reliability without all
the diesel generator activity.'

In March, the ASC MSRC will be installing a
dual-speed, high-speed switch so that if one of
the large substations goes out, it will switch to
the other substation. That won't solve the problem
if both substations go down, but it doesn't
have the high costs associated with buying and
maintaining a generator for those rare occasions
when it is needed.

Keeping watch

Power and cooling have traditionally been the
province of facility managers, but data center
managers are gaining a greater ability to monitor
and manage these areas. The use of Power
over Ethernet and Intelligent Building Systems
is the catalyst for that trend. These replace discrete,
proprietary monitoring and control systems
with ones that communicate using the
standard network and protocols.

ASC MSRC, for example, uses the opensource
Nagios network-monitoring program
for its air handlers, chillers and power distribution
units (PDUs), in addition to environmental
sensors placed throughout the floor.


There are also commercial products designed
for data center infrastructure. Aperture
Technologies' Vista gives visualization and
real-time monitoring of the data center physical
infrastructure. American Power Conversion's
Change Manager and Capacity Manager
monitor items such as UPSes and PDUs and
stores the data in a centralized database.

In some cases, it works best to outsource the
monitoring. According to Forrester Research,
there is growing interest in sending remote infrastructure
monitoring and management services
to India, but it can also be done locally.

'What it breaks down to is ensuring the appropriate
service level agreements that you
need for all your services from your vendors are
in place and they are correctly and appropriately
maintained,' said Jerry Alexandratos, the Education
Department's acting director of IT
services.

No matter how good the redundancy or monitoring
tools, uptime still comes down to people.
When Gartner did a survey of causes of
downtime a few years ago, only 20 percent
were because of natural disasters or equipment
failure. The rest were caused by people.

Procedures matter

'The operating procedures and practices you
use to run your environment have a much larger
effect than technology on overall availability,'
said John Curran, senior vice president of
ServerVault, which provides hosted services for
government agencies and businesses.

ServerVault runs a Tier III data center with
complete redundancy of power, cooling and
standby equipment. The facility has had 100
percent network and facilities uptime since its
inception in 2001. He said that although facilities
and network issues can cause major outages,
they are not the typical cause. It is more
likely to be things such as configuration.
'Improving availability is more than just
adding more power or another switch to the
network,' he said. 'For the majority of customers,
it means getting a grip on what is in
their environment and what has changed.'

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above