When the Amazon S3 outage happened on the east-coast last year, it took a large number of Java cloud apps with it. Others, like Nike’s website, just slowed down. But a few, like Netflix, saw no impact from the Amazon S3 outage at all.
Achieving the level of reliability of Netflix resiliancy with your Java cloud apps requires a willingness to break things, even in production, to see how the app behaves when failure occurs. It also requires a great deal of upfront investment in thinking through Java cloud architectures, service composition, and automating recovery and rollover procedures in case an another Amazon S3 outage occurs. This isn’t necessarily the best investment for enterprises in which a short disruption is embarrassing, but doesn't rise to the level of an unmitigated disaster. In the case of the Amazon S3 outage, no one lost their data, and every Java cloud app that failed had a good excuse for their customers.
To protect against another Amazon S3 outage, here are some best practices to adopt:
Document Java cloud apps and infrastructure
Figure out on which infrastructure applications are running on, and identify which applications are dependent on others. In particular, note keystone applications used by a wide variety of other apps. For instance, In the case of AWS S3, the loss of the metadata management system killed of S3 in the entire region. Greater attention needs to be placed on ensuring the reliability and redundancy of these applications, since their failure could cascade into a larger problem.
See how things break
One of the key elements of Netflix’s strategy is to constantly break things using Chaos Monkey. They are constantly taking production systems, cloud availability zones, databases, and services offline to see the performance impact on other systems. This of course requires a tremendous amount of trust in the engineering. But it also makes it easier to identify the specific breaking points and chains of failure that can cascade through the system.
It’s much easier to test out in the cloud, where developers can economically spin up an accurate mockup of their entire infrastructure. Enterprises can also create models of in-house apps running on expensive hardware using techniques like service virtualization to achieve the same goals.
Creative teams can even have game days to identify novel techniques for breaking systems. Special prizes go to the developer or team that can cause the greatest damage with the smallest change to infrastructure.
It’s also important to determine if application rendering is dependent on third party services for data. Quite a few web apps break when something simple like an ad-insert goes down. A simple experience can be better than none. Take a cue from the Starbucks baristas near my house that served free coffee in 4-ounce cups when their payment server crashed nationwide. The loss of a few gallons of coffee pales in comparison to the loss of goodwill of uncaffeinated customers.
See how Java cloud services bounce back
The process of breaking application infrastructure can also highlight how applications rollover and recover from failure. Mission critical apps might be directly copied to hot backups in other availability zones or even cloud providers. Of course, this requires a lot of investment in data storage, networking, and unused capacity. Less critical apps might simply restart from another instance of the same container.
Tools for automating the provisioning and configuring of new instances can dramatically reduce recovery times, but unfortunately, many software architects don't know about the various tools that are available to do this, nor do they understand how they work. Knowing how to uses these tools can make it much easier to scale up to meet unexpected demand.
Fail Java cloud apps gracefully
When services do die, another good practice is to provide mechanisms to provide users some level of functionality. This is easiest on native mobile apps that support their own data store. One of the real tragedies of modern app development, is that few developers take advantage of the caching mechanisms of modern browsers.
This is critical for mobile apps that can be dependent on sporadic networks. But caching can also improve the performance and reliability of desktop applications as well. This requires developers to think through strategies for managing data locally, and for automatically synchronizing data in the background.
Will cloud based performance ever compete with Java scalability and performance on bare metal?
It was more than user input error that caused the Amazon S3 outage
Why the Amazon S3 outage was a fukushima moment for cloud computing