The Amazon Simple Storage Service outage was spectacular in both its scope and its magnitude. It seemed as though it affected every internet user on the planet, and it stirred up a firestorm of bad press for the company. But let's face it, cloud outages will happen, which is why it's imperative that software architects know what to do if Amazon is down.
The harsh reality of cloud computing
"There's a reason why you don't see IBM or Google or Microsoft or Oracle trying to capitalize on the S3 outage," said Oracle Vice President Deepak Patil. "It's because we all know that problems can and will occur."
Of course, that doesn't mean software architects should think of their cloud computing platforms as ticking time bombs waiting to go off. "Every day, the cloud becomes more stable and more reliable," Patil said. He asserts that virtualization engineers and cloud architects are constantly fixing problems and writing more robust cloud computing software, building increasing resiliency into their platforms. But there is a reason why cloud computing platforms talk about three nines, four nines and five nines and not a one with two zeroes after it, which is why developers of S3 apps need to know what to do if Amazon is down. Some degree of infrastructure downtime needs to be built into every software strategy, and organizations need to have a plan on how to handle that .01% of the time when their systems might be offline.
In Amazon's defense, the Feb. 28 outage occurred in one of 16 Amazon Web Services (AWS) regions. It was the Northern Virginia (US-EAST-1) Region that failed for four hours. The other 15 regions remained stable during the outage.
The not-so-expert advice
In the days following the February 2017 Amazon S3 outage, much of the DevOps guidance addressing the topic of what to do if Amazon is down was decidedly flippant. The unhelpful advice many experts espoused about the S3 outage packaged itself in the assertion that all of these problems could be avoided if an organization had simply leveraged multiple availability zones. Unfortunately, S3 apps don't fail over to Amazon's other regions when Amazon is down.
Other equally flippant advice suggested that Amazon S3 users should be spreading their cloud computing dollars across multiple vendors, so that in the unlikely event that Amazon is down, cloud systems managed by Google or IBM would pick up the slack. Sure, that's a strategy that will work, but few will find it economical.
Furthermore, failing over to other vendors or different availability zones requires the configuring of a domain name system (DNS) to route traffic between AWS zones or disparate cloud providers. Unfortunately, setting up and configuring a DNS is a difficult task that should not be attempted by the faint of heart. And besides, the whole point of using the cloud was to avoid having to set up peer-to-peer or master-and-slave server replication architectures. Besides, suggesting that the solution to a possible cloud outage is to simply add more cloud is both flippant and disingenuous.
So, what are some real strategies for dealing with that future moment in time when Amazon is down?
Staying up when Amazon is down
The first thing to do is evaluate just how much investment you want to put into adding more resiliency -- over and above the commitment of your cloud provider's service-level agreement -- to your applications.
Deepak Patilvice president, Oracle
It might actually be good enough to just apologize to your customers and point your finger at your cloud computing vendor when Amazon is down. If the entire internet breaks, your users might just forgive you if you simply say the whole thing was out of your control. And that's not flippancy. Plenty of services that went down on Feb. 28 did exactly that, and customers accepted the apology.
As pragmatic as that advice is, there may be those who take some issue with it. "Deploying applications to the cloud does not absolve the software development team of responsibility. You can't just deploy an application to the cloud and say to the vendor, 'This is your problem now,'" Patil said. "You own the applications you deploy. How they behave in the cloud is still your concern."
It's a sentiment with which Asanka Jayasuriya, head of engineering of Atlassian's HipChat, would likely agree. "You need to know exactly how your applications will behave under stress," Jayasuriya said. "That means actually triggering failures in your system and seeing how the application responds."
Jayasuriya praised the virtues of a tool named Chaos Monkey, a program whose job is to randomly kill runtime systems and processes. As Chaos Monkey forces systems and subsystems to fail, teams monitor their applications and note how they respond. Does the failure of a data system cause a complete outage? Does the shut down of a particular node only impact a trivial application feature? When Chaos Monkey is unleashed on a production system, new insight is garnered about where redundancy is weak, and where fail-safes are required.
Thinking about 'graceful degradation'
So, what do you do when you've identified holes in your architecture? The first inclination is to patch them, but that's not necessarily the right answer. Remember, a disruption in cloud service will only be temporary. If you're application brokers stocks with a millisecond trade execution promise, patching every hole might be necessary, but if you're streaming music, attention to every little detail might not be necessary.
"Graceful degradation is the key to dealing with service disruptions" Jayasuriya said. Within any application, there are always certain features that are used more often than others, and there are functions that users would prioritize over others. After prioritizing features, systems should be designed to provide the greatest deal of resiliency to the features the user needs most, and lesser-used features can be allowed to go offline completely when Amazon is down.
Software strategies for the cloud
This, of course, then leads into a discussion on how to develop cloud applications. It's difficult to segment a monolithicic application into high-priority and low-priority pieces. But if software development takes a cloud-native approach by building in a high degree of modularity, perhaps by using containers and microservices, then breaking functions and features into separate zones of availability becomes much easier. With an Agile- and DevOps-based approach to cloud-native systems based on microservices and Docker, some parts of an application can easily be hosted in the cloud, while key, high-priority features could take advantage of cloud bursting, or be dual-hosted in both public and private clouds. That way an acceptable level of redundancy can be achieved without being a configuration nightmare or extremely cost prohibitive.
Cloud computing platforms will fail. Amazon proved that unequivocally with the Feb. 28 S3 outage, but by no means are the other vendors in this space immune from the prospect of failure. As such, it's the responsibility of application developers to be prepared for instances of disruption, to choose how they want their applications to gracefully degrade when downtime does occur and code their applications in a cloud native way so that they are flexible enough to remain running, even if Amazon is down.
You can follow Cameron McKenzie on Twitter: @cameronmcnz
Interested in more opinion pieces? Check these out:
- Why the Amazon S3 outage was a Fukushima moment for cloud computing
- Software ethics and why ‘Uber developer’ stains a professional resume
- It was more than user input error that caused the Amazon S3 outage
- Don’t let fear-mongering drive your adoption of Docker and microservices?
- Stop adding web UI frameworks like JSR-371 to the Java EE spec
Did the Amazon S3 outage irreparably rock the industry's faith in cloud computing?
How the Amazon S3 toppled
Security issues raised with Amazon S3 shutdown
Here's what to do if Amazon is down, and your apps need to stay up
Don't buy into the hype about 12-Factor Apps