This content is part of the Essential Guide: Amazon S3 outage: A guide to getting over cloud failures
Manage Learn to apply best practices and optimize your operations.

Amazon S3 outage a Fukushima moment for cloud computing

The Amazon S3 outage has turned into the Fukushima moment of cloud computing, as users re-evaluate the cloud's long-term viability.

For five hours, the internet was broken and it was Amazon's fault. The Simple Storage Service (S3) outage had not only taken out the medium and small-scale enterprises who leverage Amazon services, but it managed to pull the covers off the underpinnings of big vendors like Apple, exposing to the world that the Apple iCloud wasn't an Apple cloud at all. Instead, it's just a branding label slapped on another cloud-vendor's clock cycles.

This nightmare wasn't supposed to happen. Cloud services aren't supposed to go down. More to the point, they aren't supposed to go down for five hours. And most perplexing of all, they're not supposed to go down when the tide is highest during the day, starting around lunch hour on North America's east coast and the start of the workday on the west.

Cloud computing's Fukushima moment

When future historians talk about the rise, fall and plateau of the cloud, the Amazon S3 outage will no doubt be seen as a Fukushima moment.

Some call nuclear a cheap and clean source of energy, and if you don't factor in the million-year maintenance fees for monitoring hazardous waste, there's a good argument to be made for harnessing the atom. Nuclear is clean, compact and has a relatively small geographical footprint compared to something like China's Three Gorges Dam. And, of course, nuclear can generate endless amounts of on-demand power. All of these are pretty compelling reasons to drill for uranium.

Of course, there's a fairly compelling reason to abandon nuclear, and it can be summed up in one word: Fukushima. You can argue the advantages of nuclear power until you're blue in the face, but so long as the long shot possibility exists that a nuclear meltdown will poison the international food supply and turn the town you live in into a radioactive wasteland, people are going to choose windmills over deuterium-cooled reactors.

A ticking time bomb in the data center

It's the same thing with the Amazon S3 outage. Amazon can promise three nines, four nines or all the nines in the world, but so long as the possibility exists that their service will blow up in the middle of the day because someone with super-user rights typed in an incorrect command from their troubleshooting playbook, organizations that need stable and reliable systems will think again on going whole hog on Amazon S3.

Amazon admitted that the Amazon S3 outage was due to the fact that their system index had grown to such a monstrous size that nobody really understood it, and nobody in their organization predicted that it could cause such a disastrous problem if restarted. That has left users wondering how many other ticking tomb bombs exist within the infrastructure. Before the S3 outage, people invested in the Amazon cloud because they were confident in both the technology being used and the manner in which it was managed. With the S3 outage, what was once confidence has been replaced with faith -- and faith isn't a compelling attribute when it comes to IT risk assessment.

Solving the cloud computing problem

If IT departments can't depend on cloud service providers to solve the availability problem, the value proposition of using something like Amazon S3 dwindles to nothing.

Cloud advocates haven't been shy with proselytizing their proposed solution for dealing with future cloud outages. Their answer to the problem is simple: more cloud. After all, the Amazon S3 outage was localized to the North American compute grid. If customers had leveraged other data centers, there would have been redundancy. Of course, it's a ludicrous assertion, as if software architects are going to build cloud clusters for their applications. It's like saying the best way to deal with the fact that nuclear generators might blow up is to just go and build more nuclear generators.

On par with the promise of cost savings, eliminating the need to set up convoluted environments of horizontally, vertically and geographically clustered servers in order to maintain four nines of uptime was the primary compulsion driving organizations into the cloud. If IT departments can't depend on cloud service providers to solve the availability problem, the value proposition of using something like Amazon S3 dwindles to nothing.

Bringing the cloud home

So, what's the future of cloud computing, now that users realize that a full-scale, daylight-hours crash is always a possibility? The move will be for organizations to start bringing more of their systems back into the local data center. Leveraging cloud computing technologies like OpenStack will allow organizations to build their own in-house data systems where the benefits of cloud computing can be realized without handing over full control to a third-party vendor. Not only does it put control back into the hands of the IT department, but other worries like controlling external costs and dealing with security and auditing are no longer a governance headache.

The other big move will be for organizations to leverage cloud bursting technologies while making use of the cloud with their in-house systems approach capacity. But using the cloud exclusively will become a thing of the past. The Amazon S3 outage was a Fukushima moment for cloud computing, and it will forever taint the way organizations view the cloud.

Tweet this article!

You can follow Cameron McKenzie: @cameronmcnz

Interested in more of Cameron McKenzie's opinion pieces? Check these out:

Next Steps

Looking for a unified theory of all things cloud native, including DevOps, Agile and continuous integration? 
Will cloud based performance ever compete with JVM performance on bare metal
Will the term 'deprecated Java method' be given meaning in Java 9?

Dig Deeper on Java cloud platforms

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

How has the Amazon S3 outage shaken your confidence in reliability of your cloud computing provider?

I find it very hard to agree with any statement that it will forever taint the cloud or even cause organisations to leverage their own data centres again. Technology doesn't go backwards but instead will fix forward so that if you want your cloud provision to be geographically shared there will be a helpful tick box in your provisioning to enable this.

It is concerning that Amazon themselves had lost control of their system index but this is equally something that could happen in any data centre and to any vendor. The second issue is the lack of transparency of platforms (including Apple) that rely on S3 and this needs to be addressed so that customers of these platforms at least understand the risks that they are taking. BTW personally I am comforted that Apple's cloud offering is at least being managed by an organisation who understand the cloud best notwithstanding the outage.

The power industry does build multiple power stations in case one fails, the grid is designed to deal with failures, Fukushima did not turn the lights out in Japan. A properly designed, distributed application would not have failed just because one node went down, and putting all your eggs in one in-house system is no better than building on one AWS region.
Having witnessed raid array failures, UPS failures, failed restores from tapes, index corruptions, deadly embrace problems and router mac address tables corrupted by people moving equipment around it is only a question of time before a major cloud service  provider loses control over its data.
Cloud computing is a risk vs. reward decision:

Main drivers in the decision are:
  • Service Offerings "Time to Market"
  • Corporate IT cost savings (IT outsourcing)
  • Leveraging existing global cloud infrastructure
  • Providing high level interfaces to complex IAAS/SAAS services
  • ETC...
But these gains can come at a significant risk and can be detrimental to a companies' technological success in the future. "No one" can predict the future, and the decision to move to the cloud should be made judiciously. But many corporations have proven time and time again to be concerned with only the current bottom line (with limited concern for the potential future pitfalls related to the current savings). CEO's and IT CIO's are driven to increase profits year over year and the cloud is an attractive cost saving innovation. But at what risk? 

Cost Savings:
  • This latest AWS outage (now the second outage) proves there are significant drawbacks with moving to the cloud (if cost savings is the primary objective). As explained by many cloud experts: if an AWS consumer had paid for addition redundancy options, they may have been unaffected. But paying for the additional redundancy can very well negate the cost savings associated with moving to the cloud in the first place.
  • Unconditional dependencies on the cloud provider (AWS in this case)
  • The loss of low level control over your intellectual property and potentially the security layers used to protect it
  • The HR implications of dealing with outsourced vendors (both in and out of the country)
  • The complexity to IT management overall (coordinating/orchestrating new solutions with a dispersed set of IT staff, some with skin in the game and some not).
  • ETC...
The "Rise of the Developer":
  • Is a driving force behind the move to the cloud, as developers can deliver solutions through cloud computing to market much faster, but at what risk?
Points of concern:
  • This latest failure is the undeniable proof that cloud computing is only as reliable as the people managing it. 
  • Unfortunately when you insert enough layers, this becomes difficult to understand, and to identify who is actually managing (and protecting) your data!
  • Who knew Apple's cloud infrastructure was (at least partially) supported by AWS and rebranded.
  • It's a scary thought to think that most of our banking and financial transactions are now handled by cloud computing services.
  • Every aspect of our lives is becoming more and more dependent on cloud computing.
  • The trend is now to consolidate a traditionally dispersed IT infrastructure and create massive points of failure across the internet.
  • This puts the health and stability of our lives at risk!

I think the move to cloud computing is a risky proposition and that decision should include all aspects of your IT infrastructure, especially how it supports your business (not only from a technology perspective).

Just my take on this...