Don’t let your load balancers ruin your holiday business


News: Don’t let your load balancers ruin your holiday business

  1. An eCommerce site that crashes 7 times during the Christmas season, being down for up to five hours each time it crashes is a site that loses a lot of money and suffers reputation damage. It happened to one of our customers, before we started working with them. They shared their story and what they learned at our annual perform conference earlier this month. Among several reasons that led to these crashes I want to share more details on one of them which I see more often with other websites as well. Load Balancers on Round-Robin instead of Least-Busy can easily lead to App Server crashes caused by heap memory exhaustion. Let’s dig into some details, and see how to identify these problems and how to avoid them.

    The Symptom: Crashing Tomcat Instances

    The website is deployed on 6 Tomcats with 3 Frontend Apache Web Servers. During peak load hours individual Tomcat instances started showing growing response times and a growing number of requests in the Tomcat processing queue. After a while these instances crashed due to out-of-memory exceptions and with that also brought down the rest of the site as load couldn’t be handled any more with the remaining servers. The following image shows the actual flow of transactions through the system highlighting unevenly distributed response time in the Application Servers and functional errors being reported on all tiers (red colored server icon):

    Continue reading the full blog ...

  2. You can never eliminate entirely such issues with "monitoring" because the performance of the database and the volume of request workload is not under the control of the application which is where "adaptive control in execution (ACE)" comes in to profile, police, protect, prioritize, predict even provision comes in.

    With regard to the load balancer the issue here is that both "round robin" and "least busy" are not adaptive and not predicting when the current work in progress across all nodes will complete. We recently did a very interesting project for a big site on truly adaptive load balancing using our Simz technology which simulates in real-time the entire work-in-progress queue in a single JVM which is used by the LB for selection purposes.

  3. No need to be a fortune teller to know that William will comment on my post with a link to his own studies and products :-)

    So - besides linking to your product - do you also have a link to your case study that you are refering to?

  4. the probability of me not replying to yet another "have you seen our 90's path trace tree view screenshots" posting is inversely proportional to the probability you and dynaCopy will ever have original thought and invention beyond plagiarizing prior work or writings ;-)

  5. Touché[ Go to top ]

    This coming from a guy who was "JX(Wily)Inspired"!!

  6. you seemed confused[ Go to top ]

    In fact it was Wily that copied from us...remember SQL Agent...well that was released after we released JDBInsight 1.0/1.5. Even the Wily sales engineers admitted they tracked our work. They never managed to get anywhere close to what was delivered in JDBInsight in terms of SQL transaction analysis....let me walk you donw memory lane: (2003)

    Who was the first to do distributed CORBA/JMS path tracing...well I suspect it would be the guy that worked on this technology previously at Borland....I wonder who that guy was....hmmmm

  7. I like the first part of the article where the problem is pinpointed to the data access layer and JDBC. And that's all gui, no ad-hoc log analysis.

    And conclusion seems obvious - connection pool is under-provisioned  for a peak load.  Very nice so far!

    But then, out of nowhere

    Now, it was not only the size of the pool that was the problem – but – several very inefficient database statements that took a long time to execute for some of the application’s business transactions.

    Ouch. This leaves me wondering how exactly the true root cause was identified and why no screenshots illustrate that.

    And in the end, how much load balancer was relevant to the issue?

    I'd prefer the article to elaborate on both points.

  8. Thanks for the feedback - let me try to add a little more color to this: The misconfigured load balancers caused an unevenly distributed load. The combination of not optimized connection pool size and inefficient execution of SQL statements (some very long running, some executed too many times) - the worker threads kept the connection too long which in the end caused a resource shortage and ultimately led to the crash.

    As with any complex application there are many things that contribute to performance and scalability. A system has inefficient code may still run fine until one of the architectural components reaches a point where it affects and multiplies the inefficiencies of other components. Thats what happened in this case. So - while changing the load balancer to least-busy would have probably solved the crashes it would have not solved the problems they had in their app. Eventually - with enough load on the system - it would have still caused the site to crash.

    hope this makes a little more sense - if it does I will make sure to also update it in the original article.


    You dont need to dig too hard, or setup too complex a test to diagnose inefficient SQL calls. Pretty much 90% of all performance issues with web (or other) systems that make heavy use of databases stem from such queries. The other 10% are usually memory leaks.

    Firstly, you dont need to look at the pool for this. Databases like oracle will tell you how many inefficient queries there are. Very easy to spot.

    Apart from obviously writing efficient queries, the trick to limiting impact on the rest of the app is:

    1) Know how many connections total you need and make sure they are planned for and available

    2) When using a pool (like c3p0), set some sort of "connection request timeout". Catch the resulting exception when the databse becomes unavailable to one app server and serve a controlled error page. It will also become obvious when database is your issue.

    3) Dont just blindly use the database for everything. Cache in local memory where you can. Avoid storing in session too much. 

    4) Be very careful with indexes. I have seen apps written by reputable consulting companies.. blue and red, western and eastern, that have had 1000% performance increases applied within 5 minutes of looking at their use of queries and indexes. In production ;)

    So.. the proper title of this story is "dont let lazy programmers ruin your holiday business with crappy queries"..

    Better still. Dont let any developer implement a query unless it has been approved by an experienced dba or has been subjected to a simple barrage test.

    Just sayin.



  10. I agree with all your database related points. There are some additional ones though

    a) too many sql statements executed per transaction, e.g: executing hundreds of SQL statements where it could either be done with joins or done more efficiently in a stored procedure

    b) requesting too much data, e.g: querying all data in the database and filtering the data in the application logic

    In my reply to another comment on this I tried to explain why - even though the inefficient DB access was the final root cause of the problem - it required the load balancers incorrect configuration to bring the architectur of this application to its tipping point and to crash. Telling your devs to write proper SQL is one part of it - but not all of the story.