Did Human Error Cause The Amazon Cloud Computing Outage?
“Only two things are infinite, the universe and human stupidity, and I’m not sure about the former.”
- Albert Einstein (1879-1955), Famous scientist and Time’s Greatest Person of the Century.
Amazon would like everyone to believe in the aforementioned quote. By blaming human error for the outage that left online businesses like Reddit and Foursquare hanging, Amazon can deflect flak from its technology. However, concerns still remain, and in spite of Amazon’s reassurances and its offer of a 10-day credit, there are valuable lessons to be learned here (See: Lessons from the Amazon Cloud Outage).
Eight long days after the incident that is considered the biggest failure in the short history of cloud computing, Amazon came up with a detailed explanation on its website (See: Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region). While answering some questions, the document gave rise to some important questions.
The first item up for discussion is Amazon’s allusion to human error as the cause of the outage. While it did not say so specifically, it did mention “a network configuration change”, something that can be safely inferred as a specific manual mistake during some network adjustment, such as choosing a wrong menu option or entering a wrong network name. In other words, a human error, and hence, susceptible to human frailties like forgetfulness and lack of concentration. This gives rise to the obvious question, “What is the guarantee it won’t happen again?”
What happened that day is made clear in the aforementioned explanation. The primary network serving one of the four availability zones in EC2′s US East-1 data center needed more network capacity. The “network configuration change” attempted in order to provide this extra capacity resulted in mistakenly routing traffic from the primary network to a lower-capacity secondary network.
Now, these networks connect individual node computers that store, manage and back up customer data, combining to form clusters. Even after the mistake was detected, too much traffic had accumulated for these nodes to handle, even after the primary network had been activated. Some data was left continuously searching for free storage space, clogging up the system and blocking new traffic. Gradually, websites that depend on Amazon’s cloud storage services began to malfunction.
Although Amazon tried to address the issue by disabling new traffic’s requests for storage space, it was too late to prevent node failures, leading to a domino effect. Over the next few days, Amazon added storage capacity and modified its storage algorithms to bring its systems up to speed.
However, Amazon’s explanation about the failure has raised some eyebrows. Why wasn’t there a check in the programming to prevent such a simple mistake that resulted in such a monumental failure, ask experts.
The second item up for discussion is Amazon’s traffic management system. Some clients like Sharefile and Bizo managed to ride over the crisis by shifting its traffic to Amazon’s West Coast data center when they found their servers failing. While Amazon does offer this option to its clients, ideally it should make such shifting of traffic a part of standard operating procedure. While this will undoubtedly degrade performance as more traffic compete for fewer resources, total breakdown can be avoided.
While this incident would have undoubtedly shaken cloud computing to the core, it is no way a death knell for this technology. As long as lessons are learned and vulnerabilities addressed, Amazon can continue on its tremendous growth trajectory as far as cloud computing is concerned.
By Sourya Biswas