Lesson from Amazon's cloud crash: Plan for the worst
Service failure showed what can go wrong, but one government site showed how to be prepared
- By Kevin McCaney
- Apr 28, 2011
This article has been updated from its original version to include a reference to Amazon's explanation of the failure.
Proponents of cloud computing might have felt a slight chill up their spines recently, when a failure hit Amazon Web Service’s (AWS) Elastic Cloud Compute, sending several popular social media sites and one Energy Department site, among others, into the dark.
This was no blip. Energy’s OpenEI.org site, an open, Semantic Web collaboration platform for sharing work on clean energy, was down nearly two full days. Sites such as Reddit, Foursquare and Quora weren’t down quite as long, but they were still dark for a day or more, an eternity for social networking sites where users post and consume information from minute to minute. For some, it was panic-as-a-service.
Amazon has restored EC2’s lost services and offered a detailed explanation of what went wrong. The company also is offering 10-day credits to affected customers. Knowing the actual cause of the failure is no doubt be useful and the credits are a nice gesture. But the bottom line — and for government and private-sector websites, uptime is the bottom line — is that sites that depend on those cloud services were out of commission for a long time.
Amazon cloud crash keeps Energy site off-line
So what does this mean for agencies going to the cloud? Does the Amazon crash prove that cloud is a crap shoot for any organization that takes its services and data seriously? Will agency applications in the cloud forever be at the mercy of remirroring run amok?
The short answer is almost certainly no. The longer answer is no but also that the EC2 crash is a cautionary tale, a reminder that moving to the cloud involves the same prep work, attention to detail and contingency planning that goes into any other critical network.
Case in point: Recovery.gov, which is hosted at the AWS Northern Virginia data center where the failure occurred, remained in operation throughout. The reason? The Recovery Accountability and Transparency Board, which runs Recovery.gov, had a backup plan, an agreement with Amazon to move its operation to another location in the event that trouble cropped up, according to a report in InformationWeek.
One lesson from the crash is that cloud services won’t always be perfect. But the better lesson is the importance of a contingency plan.
Agencies are increasingly moving operations to the cloud, and for good reasons. Cloud computing frees data center space, cuts maintenance costs and power use, increases the availability of systems for mobile users, and, above all, saves money. Also, the Office of Management and Budget has decreed it, requiring agencies to move three applications to the cloud in the next 12 to 18 months.
But although cloud services can make some things easier for agencies, getting there isn’t a snap. Even moving e-mail systems, which are generally deemed the lowest of the low-hanging fruit for cloud migration, is fraught with pitfalls, as Rutrell Yasin reports in this issue.
The best approach, experts say, is a careful, thorough one. The impact of Amazon’s EC2 crash — which sites went dark and which stayed up — proves the point.
Kevin McCaney is a former editor of Defense Systems and GCN.