COMMENTARY

Lesson from Amazon's cloud crash: Plan for the worst

Service failure showed what can go wrong, but one government site showed how to be prepared

This article has been updated from its original version to include a reference to Amazon's explanation of the failure.

Proponents of cloud computing might have felt a slight chill up their spines recently, when a failure hit Amazon Web Service’s (AWS) Elastic Cloud Compute, sending several popular social media sites and one Energy Department site, among others, into the dark.

This was no blip. Energy’s OpenEI.org site, an open, Semantic Web collaboration platform for sharing work on clean energy, was down nearly two full days. Sites such as Reddit, Foursquare and Quora weren’t down quite as long, but they were still dark for a day or more, an eternity for social networking sites where users post and consume information from minute to minute. For some, it was panic-as-a-service.

Amazon has restored EC2’s lost services and offered a detailed explanation of what went wrong. The company also is offering 10-day credits to affected customers. Knowing the actual cause of the failure is no doubt be useful and the credits are a nice gesture. But the bottom line — and for government and private-sector websites, uptime is the bottom line — is that sites that depend on those cloud services were out of commission for a long time.


Related story:

Amazon cloud crash keeps Energy site off-line


So what does this mean for agencies going to the cloud? Does the Amazon crash prove that cloud is a crap shoot for any organization that takes its services and data seriously? Will agency applications in the cloud forever be at the mercy of remirroring run amok?

The short answer is almost certainly no. The longer answer is no but also that the EC2 crash is a cautionary tale, a reminder that moving to the cloud involves the same prep work, attention to detail and contingency planning that goes into any other critical network.

Case in point: Recovery.gov, which is hosted at the AWS Northern Virginia data center where the failure occurred, remained in operation throughout. The reason? The Recovery Accountability and Transparency Board, which runs Recovery.gov, had a backup plan, an agreement with Amazon to move its operation to another location in the event that trouble cropped up, according to a report in InformationWeek.

One lesson from the crash is that cloud services won’t always be perfect. But the better lesson is the importance of a contingency plan.

Agencies are increasingly moving operations to the cloud, and for good reasons. Cloud computing frees data center space, cuts maintenance costs and power use, increases the availability of systems for mobile users, and, above all, saves money. Also, the Office of Management and Budget has decreed it, requiring agencies to move three applications to the cloud in the next 12 to 18 months.

But although cloud services can make some things easier for agencies, getting there isn’t a snap. Even moving e-mail systems, which are generally deemed the lowest of the low-hanging fruit for cloud migration, is fraught with pitfalls, as Rutrell Yasin reports in this issue.

The best approach, experts say, is a careful, thorough one. The impact of Amazon’s EC2 crash — which sites went dark and which stayed up — proves the point.

About the Author

Kevin McCaney is editor of Defense Systems. Follow him on Twitter: @KevinMcCaney.

Reader Comments

Tue, May 3, 2011 madhtr

I've been in IT for 15 years ... I know several different programming languages (including c++), I'm a network administrator, and It's beyond me why so many of you are willing to hand the power of IT back to the big guys after we secured it for ourselves with the advent of the PC. This is but a sample of "bad" things to come if we continue down the path to handing our control over to "the cloud". Say goodbye to your DIY datacenter, and hello to the chains of your ISP.

Mon, May 2, 2011 FR Alexandria Va

The operative word here is PLAN or in a much better way architect the landscape of the problem you are committing your organization to. Most successfully this has been done using the discipline of EA or what ever you may call it, and the irony of it is in Government most agencies have accomplished lots of this work already. They should use it to protect their rear end and reduce risks.

Fri, Apr 29, 2011 Sean GA

Kevin, not all sites were completely down on a long term basis, the long term outages affected database capabilities. Just like you said, it is all a crap shoot, but when you pay for quality employees, they don't drag down the entire cluster with an incorrect network configuration change. Was it really worth all this hassle to save a few bucks by switching to the largest webhost in retrospect? If this is the cloud (I loathe the term "cloud") and your workstation ran from Amazon EC2 or EDS services, how long would you CIO have a job after this outage? Assuming this is a small business of 5,000 employees, my guess is the CEO would be scouting a new CIO during the outage and reconsidering Amazon's services. Point proven, indeeed.

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above