An eCommerce site that crashes 7 times during the Christmas season, being down for up to five hours each time it crashes is a site that loses a lot of money and suffers reputation damage. It happened to one of our customers, before we started working with them. They shared their story and what they learned at our annual perform conference earlier this month. Among several reasons that led to these crashes I want to share more details on one of them which I see more often with other websites as well. Load Balancers on Round-Robin instead of Least-Busy can easily lead to App Server crashes caused by heap memory exhaustion. Let’s dig into some details, and see how to identify these problems and how to avoid them.
The Symptom: Crashing Tomcat Instances
The website is deployed on 6 Tomcats with 3 Frontend Apache Web Servers. During peak load hours individual Tomcat instances started showing growing response times and a growing number of requests in the Tomcat processing queue. After a while these instances crashed due to out-of-memory exceptions and with that also brought down the rest of the site as load couldn’t be handled any more with the remaining servers. The following image shows the actual flow of transactions through the system highlighting unevenly distributed response time in the Application Servers and functional errors being reported on all tiers (red colored server icon):