It was more than user input error that caused the Amazon S3 outage
Maybe there was an input error during some daytime troubleshooting, but input error was not what caused the Amazon S3 outage.
It's a nice way to spin it, and it's certainly the way the media reported the mishap to the general population, but a typo wasn't the cause of the disastrous Amazon S3 outage of Feb. 28, 2017. It's understandable how people might get that impression, especially given how Amazon's press release was worded, but rest assured, the Amazon S3 outage wasn't caused by a user input error:
The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
Given the universal desire for simple answers to complex problems, news media picked up on the farcical theme of a well-intentioned system administrator doing something akin to typing in a zero instead of the letter O, and, all of a sudden, breaking the internet. It's a "this could happen to anyone" type of explanation, a type of explanation that lends itself to instantaneous forgiveness. "Sheesh, I could totally see that happening to me," is what one unconsciously intonates to themselves, quickly turning the complex, highly-technical Amazon S3 outage into a sympathetic one of human fallibility.
The true nature of the Amazon S3 outage
But let's be honest: The Amazon S3 outage wasn't caused by human fallibility. That's like saying the historic Northeast blackout of 2003 was caused by a guy in Cleveland who turned on his air conditioner when the grid was already at maximum capacity. An input error may indeed have triggered the cloud outage, but it certainly wasn't the cause.
Reading past the report of a finger that slipped when pounding keys on a keyboard reveals that the real cause of the Amazon S3 outage is the development of a system that has become so large and so interconnected that even the best and brightest engineers in the industry can't accurately predict how it will behave in unexpected situations.
According to Amazon, the individual troubleshooting the billing system who typed in the wrong command managed to remove support for two separate Amazon subsystems, one of which was the index. "The index subsystem manages the metadata and location information of all S3 objects in the region," according to the Amazon press releases’ description of that subsystem. The second subsystem to get sidelined dealt with storage allocation, and this system was dependent upon the aforementioned index.
One of the most fundamental rules in creating robust systems is to create areas of isolation. If someone in your accounting department troubleshooting the billing process can decimate servers in your data center, I'd say the systems need to be a bit more isolated. And if that type of potential does indeed exist, simple prudence would have the troubleshooting tasks occur in off-hours, not as users on the East Coast are coming back from lunch and when users on the West Coast are logging on in the morning.
The other fundamental rule for creating a stable system is to eliminate single points of failure by providing tested redundancies. The fact that the index subsystem could go down and the entire S3 system gets knocked offline until the Brobdingnagian index is fully rebooted should be of grave concern to anyone who has put their faith in the availability of Amazon's cloud.
Unbridled Amazon S3 subsystem growth
But even more concerning is it would appear that nobody at Amazon had ever wondered what might happen if the gargantuan index failed. "We have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected."
That whole paragraph creates a vision of a couple of brilliant Amazon engineers sipping coffee, watching this mammoth index subsystem grow and wondering if anyone should be running tests to see what might happen if it fails.
There is no doubt that Amazon's S3 is complicated, but the idea that the index subsystem could take the whole system down is very troubling. What's more troubling is that none of the experts had addressed the possibility of its failure. That makes me wonder how many other ticking time bombs exist within Amazon's cloud. What other subsystems might completely fail the next time someone working on the billing system types something incorrectly into a command window?
Amazon S3 weakness identification and analysis
Even more troubling to me is the fact that the damage done by the failure of the index subsystem doesn't seem like a completely unpredictable problem. In fact, it seems fully predictable. If you have a subsystem that everything depends upon and it is growing massively, maybe there should be a test to see what happens when it fails?
Amazon’s public relations team did a great job getting in front of the story as interest in the Amazon S3 outage spread throughout the media. The simple story about the whole failure being caused by the slip of a finger detracted from the true failure of Amazon S3, which is the fact that subsystems had grown to a point where their behavior outpaced the ability of their engineering team to control them. Sure, user input error may have been the trigger, and it certainly works as a believable explanation, but it certainly wasn't the reason for the Amazon S3 outage.
You can follow Cameron McKenzie: @cameronmcnz
Want more opinion pieces?
- Software ethics and why ‘Uber developer’ stains a professional resume
- It was more than user input error that caused the Amazon S3 outage
- Why the Amazon S3 outage was a Fukushima moment for cloud computing
- Stop adding web UI frameworks like JSR-371 to the Java EE spec
- Why the whole '12-Factor app' discussion is a fraud
Will cloud based performance ever compete with Java scalability and performance on bare metal?
Will the term 'method deprecated in Java' be given meaning in Java 9?
Here is the unified theory of cloud-native and Agile you've been looking for