Excellent article. Get me thinking.
I have implemented three mission critical applications in Banking, Insurance and FingerPrint matching area. But none of them will be close to those Internet volume. They are large scaled, but not Extreme Scaled. The difficulty I found in those organizations is that they have many large legacy applications that are hard to scale and quite often a single point of failure. Unfortunately most of my applications so far have to interact with a couple of them and sometimes many of them (over 20 in one case). The failure pattern is very hard to predict.
Totally agree with this article on the vertical failure. I call it 'fail gracefully' with delayed retry and timeouts. Whenever dependent systems are down, don't choke the application with new transactions. Load control them by either slowing them down and gracefully rejecting them at the client side. Don't trash existing transactions, auto-recover them.
Regarding the horizontal failure, I have not had the chance to do partition based on transaction types. They are all entering the same system. Wondering about the cost of the partition though. My trick is to scale them up with application instances residing on physical servers. Each instance has a Governor component which is integrated with the Load Control component. The Governor will stop or slow down the admission of new transactions if the Load Control component reporting high load. Isolation different kinds of work is the key to success, sort of the same idea with SEDA. Most of Workflow and ESB engines like WebLogic Integration, Mule ESB and Oracle SOA Suite adopted the same idea.
Tao