Amazon's major cloud outage teaches valuable lessons
Amazon outage isn't all doom and gloom for cloud computing
Every time there’s a major cloud-related outage, someone will question whether government agencies should be putting their trust and data in the technology. Those critics got a little louder after an April 21 outage of Amazon Web Services' Elastic Compute Cloud, which brought down the Energy Department's OpenEI.org, a collaboration platform for people who work on clean-energy solutions, among other sites. The site was down for almost two days, along with the popular social networking sites HootSuite.com, Reddit and Quora. The entire outage took about five days to completely resolve.
However, an examination of what happened should help quiet those voices, say experts, who are quick to point out that there’s no reason to fear the cloud. In fact, the outage is a perfect way to showcase how most customers could have avoided that fate completely. Amazon’s problem started during a routine network upgrade. A configuration error caused a large number of Elastic Block Store (EBS) volumes to become unable to service read and write operations, according to a report released by the company.
“When this network connectivity issue occurred, a large number of EBS nodes in a single EBS cluster lost connection to their replicas," according to the company's report. "When the incorrect traffic shift was rolled back and network connectivity was restored, these nodes rapidly began searching the EBS cluster for available server space where they could remirror data.… The free capacity of the EBS cluster was quickly exhausted, leaving many of the nodes ‘stuck’ in a loop, continuously searching the cluster for free space.” That led to what Amazon calls a remirroring storm. A large number of volumes were effectively stuck while the nodes searched the cluster for the storage space they needed for their new replica.
Planning for disaster
Diagnosing the outage is important because users who had planned ahead and constructed their cloud presence so that data resided in multiple regions or different availability zones — presences mapped to different data centers in the same region — were able to avoid an outage or get their services back quickly. “The lessons learned are that cloud — like any other complex system — must also be architected for availability, there is a cost for that high availability, and human error is always a factor,” said Mitchell Ummel, director of the Cutter Consortium’s Government Public Sector Practice.
Ummel said there’s good news that many of media outlets glossed over: the majority of Amazon’s customers were unaffected. “We read about big customers such as Netflix that didn’t trust one availability zone and were able to fail over to another availability zone or another private or public cloud instance,” he said. “Yes, there’s an added cost to that, but when a system is crucial, people will pay for it.”
Another lesson that everyone should take away from the Amazon outage is the fact that no systems, cloud or otherwise, can guarantee 100 percent uptime. Agencies should only agree to service level agreements when they can are comfortable with the anticipated response to a cloud service failure. Questions about data loss must be asked during any SLA discussion because, as some of Amazon’s customers discovered, a widespread outage occasionally leads to data loss. The company reported that 0.07 percent of the data stored in the affected availability zones was not fully recoverable.
Taking all that into perspective, Amazon’s outage might go down in history as one of the best things to happen to cloud computing strategies, said Deniece Peterson, senior manager of federal industry analysis at research firm Input. “It raises awareness and gets people thinking about control — who has it, how to share it with a vendor, and how to avoid problems in the future.”