News: Failure Isolation and Recovery: Learning from High-Scale and Extreme-Scale Computing

  1. While I have been building business-critical enterprise systems for a long time, I haven't worked on high-scale cloud computing or Internet-scale architectures, with tens of thousands or hundreds of thousands of servers. There are some fascinating, hard problems that need to be solved in engineering systems at high-scale, but the most interesting to me are problems in deployment and operations management, and especially how to deal with failure.

    Read more at : 

    Java Code Geeks : Failure Isolation and Recovery: Learning from High-Scale and Extreme-Scale Computing

  2. Excellent article. Get me thinking.

    I have implemented three mission critical applications in Banking, Insurance and FingerPrint matching area. But none of them will be close to those Internet volume. They are large scaled, but not Extreme Scaled. The difficulty I found in those organizations is that they have many large legacy applications that are hard to scale and quite often a single point of failure. Unfortunately most of my applications so far have to interact with a couple of them and sometimes many of them (over 20 in one case). The failure pattern is very hard to predict.

    Totally agree with this article on the vertical failure. I call it 'fail gracefully' with delayed retry and timeouts. Whenever dependent systems are down, don't choke the application with new transactions. Load control them by either slowing them down and gracefully rejecting them at the client side. Don't trash existing transactions, auto-recover them.

    Regarding the horizontal failure, I have not had the chance to do partition based on transaction types. They are all entering the same system. Wondering about the cost of the partition though. My trick is to scale them up with application instances residing on physical servers. Each instance has a Governor component which is integrated with the Load Control component. The Governor will stop or slow down the admission of new transactions if the Load Control component reporting high load. Isolation different kinds of work is the key to success, sort of the same idea with SEDA. Most of Workflow and ESB engines like WebLogic Integration, Mule ESB and Oracle SOA Suite adopted the same idea.