Fail safe

Disaster recovery tools help agencies replicate the systems they need to survive a major outage

RFP CHECKLIST: Disaster Recovery

Define your end, not means. For example, specify that if your production site goes down, you want an alternate application service and all associated data to be available within 12 hours, with no more than an hour of lost data.

If one of your critical applications is a database, ask whether the DR system covers all transaction types, including insert, update and delete.

Ask what the system does to mitigate the risk of tape or hard-drive backup failure.

Beware version creep. Spell out the current versions of the applications you need to run and inquire about the process for handling updates.

Gauge replication performance carefully. Some vendors might only provide numbers for asynchronous replication, which is typically faster than synchronous. Ask for throughput rates that reflect peak demand periods, not just averages.

Find out how seamless failover appears to users. Are they disconnected from the network? Will they have to log in again?

Know the difference between failover times for disaster recovery and high-availability situations. The latter tend to be quicker, but do not reflect real DR environments.

If vendor consolidation is a goal in your enterprise framework strategy, look for vendors with the most complete DR suite that, at minimum, includes a variety of backup tools, replication and continuous data protection.

Watch for compatibility with your existing server and storage platforms. Many DR vendors support a variety of hardware brands and operating systems, but some only support a single OS, such as Windows, or their own storage arrays.

Investigate a replication program's failback process, where it returns control to the primary site after failover. Many products do it manually, while a few are automatic.

Beware of vendors who claim certification from your server manufacturer, but only have it for one or two of their DR tools.

Remember that the people expected to run the remote site might be inexperienced. Documentation and training are critical.

BACK IN A FLASH: Symantec's LiveState Recovery supports bare-metal restoration.

Disaster recovery had been a low priority for many agencies until terrorist attacks, anthrax mailings and hurricanes progressively jolted them out of complacency. Now disaster recovery'ensuring IT works uninterrupted'is a key component of continuity of operations.

The technologies for keeping systems online in a catastrophe are essentially the same as for maintaining redundant servers, applications and databases for high availability. But dispersing systems geographically so resources stay online when disaster strikes offers a new wrinkle and a fresh set of IT challenges. Challenges that experts say agencies can and should overcome.

'Do something,' said Steve Duplessie, senior analyst at Milford, Mass.-based Enterprise Strategy Group. 'It's too cheap and too easy not to start moving data off-site.'

Three pillars of DR

The technologies behind disaster recovery fall into three general areas: backup-and-restore, replication and failover.
How an agency approaches those areas and the infrastructure it eventually deploys depend on how it chooses to characterize 'disaster recovery,' based on the agency's unique mission.

Two common DR benchmarks are recovery point objective and recovery time objective. RPO gets at the issue of acceptable data loss from a failure. If a system goes down, is it acceptable for you to bring it back online with month-old or week-old data? RTO, on the other hand, is about availability: How quickly does a system need to be back up and running?

Having identified their systems and spelled out RPO and RTP requirements, agencies can started to solicit products that meet those requirements.

At their most basic level, all disaster recovery solutions include a backup-and-restore infrastructure. In fact, much of today's DR technology debate echoes the thinking behind storage management evolution over the past decade. Agencies that need data backups in case of emergency must weigh the price/performance differences among storage media such as optical tape and, increasingly, cheap Serial ATA hard drives that could function as virtual tape drives. Agencies must consider how these technologies fit into a cost-effective DR strategy while solving their recovery point and time objectives.

'Tape actually does a pretty good job, and tape technology continues to evolve to keep pace,' said Matt Fairbanks, senior director of product management at Symantec Corp. 'Just about everybody uses tape at the back end,' he said, pointing out that some organizations physically ship tapes to different location to disperse backups.
But there are limitations. 'When you need to get online quickly, tape is sometimes not the best option,' Fairbanks said.

Related to backup is data replication, which maintains a copy of an application or database so the information is as fresh as you need it to be. The most common type of replication is asynchronous replication, in which the primary system sends data changes to the backup, then proceeds without waiting for proof the information was copied. There's a risk that systems could fall out of sync, but asynchronous replication is faster than the alternative.

In a synchronous replication scheme, the primary system waits for confirmation before proceeding, so each database essentially updates the other, which slows performance but guarantees a true duplicate. This is important for agencies the have stringent recovery point objectives.

Mind the bandwidth

Synchronous replication can be expensive to run over wide-area networks because of the bandwidth required to maintain acceptable performance. 'Keeping a hot site up and running at full speed can be very costly,' said Joe Gentry, global vice president at Software AG Inc. of Reston, Va., which sells replication software for its Adabas database management system.

Dan Miller, information systems manager at Argonne National Laboratory, likens replication to redundant arrays of inexpensive disks. 'If you've got a single disk, you better get a really good disk,' Miller said. 'Replicated servers are like RAID-ed servers.'

Still, experts say, it's important for agencies to understand the type of replication they have and how it could impact availability in a continuity of operations or disaster recover situation. Asynchronous replication could leave blocks of data unavailable if they were en route to the replicated system when the outage occurred, said Terry Stowers, a senior storage technology specialist at Microsoft Corp. 'That may be OK,' he added. 'The important thing is to know ahead of time.'

Finally, disaster recovery typically includes a failover function, in which services automatically switch from a failed system to a replicated system or site. These days, failover servers in a high availability (HA) configuration can act like a single computer'effectively a DR spin on clustering. Storage for HA clusters can either be shared (creating a single point of failure, which could be a drawback), or replicated on independent hardware linked via an IP network.

Increasingly, HA includes servers clustered not merely on-site over LANs, but across hundreds of miles in a WAN configuration'a technological stretch that brings risk with reward. With wide-area clustering, Fairbanks said, 'literally, at the click of a button we can move an entire service'an application and everything associated with it'to an alternate site.'

Failover can be manual or automatic, depending on needs and budgets. That can also be set up to happen imperceptibly. Miller said when his staff at Argonne National Lab piloted WANSyncHA Oracle replication/failover software from XOsoft Corp. of Waltham, Mass., a test database failed twice without anyone knowing.

'We weren't really aware it had happened until we reviewed logs later,' Miller said. 'There are huge economies in replication systems, because if you've got a single point of failure, you've got to throw the budget out the window to ensure it stays up.'

High-availability replication and clustering technologies are the most expensive disaster recovery options, and thus overkill for some agencies. That said, vendors insist that they're the preferred DR techniques for agencies with strict security and availability requirements, such as the Homeland Security Department and Social Security Administration. XOsoft, for example, says 11 federal agencies, including the Labor Department, use its WANSync and WANSyncHA asynchronous replication.

'You need to have geographic separation of copies of your data,' said Symantec's Fairbanks. 'It's not good enough to have a copy sitting on a storage array that's sitting next to your primary server.'

Not so simple

Actually pulling off wide-area disaster recovery, though, can be tricky. For example, third-party DR clustering tools are sometimes required to supplement the cluster technology that comes with major operating systems. Microsoft Windows Server 2003 comes with its own Cluster Service, but users often augment it with products such as SteelEye Technology's LifeKeeper or Double-Take Software's Double-Take tools, because Windows' shared-storage design can be a challenge for WAN replication.

There are two types of DR clusters: active-active and active-passive. Microsoft Cluster Service is an example of the former, while the Neverfail and XOsoft clusters are active-passive, according to John Posavatz, vice president of product management at Neverfail, which counts the Marine Corps among its customers. 'With active-passive, only one of your servers is actually doing anything for users,' Posavatz said. 'That passive system is basically ... a hot spare.'

In replication and failover scenarios, whether clustered or not, applications present another set of DR challenges.

'E-mail has probably become the most important application for business organizations and government to protect in the event of an outage,' said Bob Williamson, vice president of products at SteelEye Technology. But because of network, server and operating system dependencies, applications don't always come up easily on remote systems, a problem that is compounded when workers try to recreate their office setups at home

inside gcn

  • smart city (metamorworks/Shutterstock.com)

    Citizen engagement and public service in the era of the IoT

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above

More from 1105 Public Sector Media Group