COOP: Cover your IT bases
Disaster recovery guidance is often vague.
- By David Essex
- Sep 19, 2006
The disasters'man-made and natural'of the past five years have proved the need for continuity-of-operations planning to keep basic government services running after the displacement of facilities and the people who work in them. Federal and state agencies have long followed the COOP drill: Identify essential operations, and establish clear lines of authority for carrying them out after a disruption of normal operations.
Beyond that, federal guidance is conspicuously'and perhaps deliberately'vague. It sets goals but no specific information technology strategies for achieving them. For example, the main COOP regulation, Federal Preparedness Circular 65, requires agencies to establish alternate facilities containing 'essential equipment' and 'interoperable communications,' such as phone, fax and Internet, plus electronic and hard-copy versions of vital records and databases. It provides almost no instruction on how to implement the underlying IT infrastructure.
Agency heads and government oversight groups are realizing they must put more meat on those bones. The emphasis is on moving beyond protecting data and applications to the networking and communication channels needed to tie people to information'and to each other.
'If they're already teleworking, they're kind of already set up with a COOP plan,' said Tom Simmons, vice president for federal systems at Citrix Systems Inc. of Fort Launderdale, Fla., which makes software for remote access to applications.Advice sought
GCN asked COOP vendors, analysts and users for their advice on building the types of systems agencies need to stay up and running in an emergency.
Redundant network infrastructure is not just a nice idea, it's a mandate of the federal Office of Management and Budget, which issued a June memorandum requiring agencies with telework policies to follow National Communications System directives that call for diversity in both networks and physical entries to buildings.
The infrastructure is best treated as a series of interdependent layers, according to John Speicher, industry solution manager at Cisco Systems. Speicher said Cisco views the resilience of three major technology layers'starting at the bottom with the data network, followed by applications, then communication systems including voice'as underpinning the top, or workforce, layer. 'They build on each other as a stack,' Speicher said, adding that each provides a type of insurance that should never be cut for short-term gain. 'You don't need the bottom rung until you're at the bottom of the ladder,' he said.
Simmons recommends making sure you have enough secure virtual private network ports that can quickly be brought online. For agencies that don't routinely support remote access, Secure Sockets Layer VPNs would be better in emergencies than the alternative, IPSec, because the latter requires remote client software that can be difficult to set up in an emergency. Tools that help accommodate the likely spike in traffic, such as WAN optimization products, should also be considered.
Agencies increasingly see wireless networks'primarily WiFi but eventually the wide-area alternative, WiMax'as a backup or possibly a replacement for wired networks and phone systems.
Another option is local multipoint distribution service, of fixed wireless, part of the 28-GHz microwave spectrum that can broadcast wireless Ethernet seven to 10 miles from a single station running at up to 622 megabits per second.
The largest U.S. license holder, Reston, Va.-based Nextlink Wireless Inc., a spinoff of XO Communications, is targeting its products to agencies seeking alternate network pipes for disaster recovery and COOP.Linking up
In a widespread disaster, for example, one Nextlink station could be placed next to the nearest operational fiber link, beaming the signal to a second station that broadcasts Ethernet access in the affected area. The main requirement is that the stations have clear lines of sight.
According to John Grady, Nextlink's director of marketing, at least one federal agency is already doing this. 'We have a 100-Mbps link that actually crosses the Potomac River from downtown D.C. to Virginia,' Grady said. 'The plan all along was to use it for redundancy.' But the agency unexpectedly ended up using the link as its primary connection after its local exchange carrier went down.
'We're not necessarily going to connect people in their homes,' Grady said. 'We're more of a carrier's carrier.' In fact he said one major cellular provider is using Nextlink technology to ensure connectivity in hurricane-prone Florida, and the agencies in the defense and intelligence communities have approached the company about its services.
Disaster-recovery technologies, such as data replication, mirroring, backup and continuous data protection, are critical for keeping IT infrastructure accessible to agency personnel [for an in-depth look at the technologies, see RFP Essentials, "GCN.com
But while enterprises used to think in terms of moving tape backups around in the event of a disaster, true continuity requires faster approaches.
'As the volumes of data continue to increase, agencies often find their recovery time objective is simply not achievable,' said Jim Grogan, vice president of consulting product development at SunGard Data Systems of Wayne, Pa., a continuity services provider. 'Lots of customers choose disk-to-disk replication from their site to ours,' Grogan said. 'And you no longer look at that architecture as part of a disaster-recovery solution. It's part of everyday operations.'
Simon Mingay, a vice president of research at Gartner Inc. of Stamford, Conn., agrees that such high-availability technologies are best for COOP. Once an agency has identified the IT basis for its critical business processes, 'you're going to have those parts of the data and those applications on spinning media,' Mingay said. 'You simply don't have time to do the restore [from tape].'
Speicher said a COOP-quality data center is a 'hardened' site served by redundant telecommunications lines and located far away from the region of the primary site.
He recalled the advantage a Cleveland insurance company had over similar companies affected by a widespread power failure in the Northeast and nearby Midwestern states.
While the others' seemingly safe secondary sites'but still inside the affected area'went down, this company's Colorado site kept running. While he questions the wisdom of the replication strategies of many federal agencies that locate their secondary sites just beyond the edge of the Capitol region in, for example, West Virginia, he said most agencies have done a good job of making their data center disaster-resistant. They are making progress on data networks but still have far to go on building the communications and workforce components.Cooperate, reciprocate
Outsourcing secondary data centers and alternate work sites to disaster recovery specialists, such as SunGard and telecom carrier AT&T, is common enough. But the General Services Administration has been encouraging agencies to enter cooperative agreements to share their own facilities, or offer them reciprocally.
State and local governments are also getting into the act. 'We are certainly seeing that a lot now,' said Tushar Mutreja, Citrix's senior marketing manager for state and local government, citing Clark County and Carson City, Nev., as an example. 'They have an SLA, and they're creating hot sites right now for each other,' he said.
Likewise, the Florida Guardian ad Litem Office, a state agency set up to advocate for abused and neglected children, found that its own agreement with county governments came in handy when Hurricane Wilma struck in 2005. The agency, which has a staff of roughly 500 who coordinate 4,600 volunteer advocates, had installed Citrix Access Suite to provide browser access to servers on the state data network that hold case files. GAL also had a COOP plan in place that specified data backup procedures and IT contacts, according to chief information officer Johnny White.
After Wilma blew the roof off the Broward County courthouse, approximately 45 people were forced to work from home or at county libraries. 'It took about a week until we could communicate with everybody again,' White said. 'The counties do all of our local IT support in the local offices. They just get us to the Internet by any means, really, and we take care of the user's data and make sure they have access to it.' He said some counties are piloting WiFi access in courthouses for such high-priority users as judges and attorneys, and have arrangements with SunGard facilities. He also intends to buy low-cost office PCs and pre-position them in the safest areas as alternate work sites.
Telecommuting, now more commonly called telework, has always been critical to the COOP workforce mobility strategies. Speicher said the recent planning for
government continuity in an avian flu pandemic has only heightened interest in telework, which can minimize contagion by separating workers at home. 'Alternate sites don't work, because the whole point is isolation,' Speicher said.
Grogan said agencies normally struggle to maintain up-to-date records of personnel access levels and authentication for teleworking. Such hurdles are only compounded in an emergency. 'In a disaster, 80 to 90 percent of your workforce may not be able to come to work,' Simmons said. 'Agencies that understand more about the authentication technology for telework are very well prepared.'
Mingay estimates that in a widespread disaster, communication channels that normally run at 60 percent of capacity may be stressed to 120 percent. 'E-mail gets absolutely hammered,' he said. 'Web sites get hammered for the same reason.'
Agencies also need to think beyond mere connectivity when they approach telework as a COOP component. For instance, network administrators who are normally in charge of maintaining security keys might not be available, Grogan said, so knowledge management portals could come in handy for finding and reaching the people who can answer specific questions.Tabletop simulation
When an agency has COOP technology in place, it's important that the overall plan work as envisioned. Speicher recommends testing IT response plans in a realistic environment, or so-called tabletop simulation, to identify gaps in the infrastructure needed to keep key personnel in touch wherever they may end up after a disruption.
And COOP testing isn't just about IT. Mingay said a revealing exercise is to have the manager test the team's adaptability and preparedness by unexpectedly pulling several members out of a room. A breakdown might indicate the need for cross-training and succession planning so members can do each other's jobs if necessary. IT's ability to uncover single points of failure in technology should also apply to IT people, 'to ensure that you can continue if one or two people aren't around,' Mingay said.David Essex is a freelance technology writer based in Antrim, N.H.