Picture this. It’s mid April, 2011. Everything seems bright and sunny for cloud computing firms. A leading technology research group has recently predicted that by the end of 2012, a full 20% of businesses will have no on-site IT infrastructure. Everyone will be using the cloud instead because it’s so cheap, so effortless, so secure, so reliable! Everybody is in love with this shiny new infrastructure. Just when the future seems assured, those clouds of promise turn to lowering thunderheads. Suddenly, lightning strikes - taking down one of the largest cloud providers in the blink of an eye! What followed was a downward spiral that seriously impacted the reputation and short term growth potential of this infant industry.
You might recognize this scenario as a description of the great Amazon EC2 catastrophe of 2011. It will likely take the cloud industry years to recover from this debacle. Sadly, what devastated the trust of customers was not just what happened but how Amazon decided to communicate (or not communicate) as the problem was being resolved.
So, what did happen?
In a nutshell, the root cause of what happened is that someone screwed up what should have been a routine task and all heck broke loose. Of course, Amazon is still not saying exactly who was responsible. However, since they now claim that one of the things they are doing to prevent future problems is increasing automation, it’s likely that human error of some kind triggered the original event. You can find a detailed, technical postmortem of the whole thing here.
Amazon needs a course in basic crisis communication
Instead of communicating early and often in as much detail as possible, Amazon communicated late, infrequently, and in vague terms. They didn’t even bother to use Twitter to keep customers posted. Paul McNamara from Buzzblog reported that Amazon hadn’t issued an apology at the 5 day mark after the outage (although they did eventually apologize). This goes against all the guidelines for crisis communication. What are those guidelines and how did Amazon fail to follow them?
Tell customers what happened. (It took 40 minutes for Amazon to notify customers there was a problem. It took 8 hours for Amazon to give customers even a basic description of the problem.)
Tell customers how the problem occurred. (Amazon claims they didn’t do this during the crisis because they wanted to investigate the incident thoroughly first. Also, they claim that every single person was working on fixing the problem. I guess they don’t have a PR department?)
State who is affected, how, and to what degree. (Amazon didn’t give out precise information about the scope and nature of the problem even while it was still expanding to affect more and more customers. This left customers guessing about whether they needed to take steps to move their servers and volumes to unaffected zones. A warning would have been nice.)
Tell customers what you are doing to fix the problem. (Saying “We’re all working on it really hard” over and over doesn’t cut it.)
Tell customers what you are doing to ensure it doesn’t happen again. (This one is actually OK to do after the fact when all the details are in.)
Express sincere empathy for those affected. (It’s possible to do this without accepting blame, so there was no reason for Amazon to wait until they knew every detail about what happened and why).
Trust no one?
Cloud providers will always promise the moon. No firm is going to come out and say any of the following (even though these things may be true):
We take security seriously, but let’ face it. A bored 19 year old with a grudge might invent a novel way to take down the entire system some day and there’s %*X! all we can do to stop that.
If your business goes out of business because of a service failure on our part, good luck scraping together enough money to hire a lawyer and come after us.
Some of our employees are occasionally going to be just as overworked and under-enthusiastic as anyone else. No matter how many layers of redundancy we have in place to catch mistakes, there will always be an error rate that is > 0.
The cloud is now so complex that no one person can grasp the entire concept of our infrastructure. Much of what we handle is too massive to even be realistically testable. We’ve got plenty of specialists; but there’s always a chance we’re overlooking something that will come back to bite us later.
Solar flares, earthquakes, and other disasters happen. We’ve heard the earth’s magnetic poles are about to shift, too. Deal with it.
Just because no one is saying these things doesn’t mean you can’t trust a cloud provider. You just need to assume these risks exist even if they are unspoken.
How to protect yourself
One thing that many people overlook in the discussion of whether to trust the cloud again is that the other options (private on-site data centers) aren’t necessarily safer. Creating your own backup plan and ensuring it can be carried out even when your major cloud supplier is down is your best bet. This includes more than just deploying in multiple availability zones with a single provider. You may also need to contract with a backup service provider who can step in and take over if your primary bites the dust. Just as important, you need to have your own crisis communication plan in place to talk with your customers about how your site is being affected by a service provider’s outage. You can find additional common sense advice on what steps to take here and here.