How do you update your app in production?


News: How do you update your app in production?

  1. How do you update your app in production? (12 messages)

    Hi guys! My name is Jevgeni Kabanov and I’m a computer scientist/hackerpreneur. Recently I’ve been trying to figure out what is happening in the world of Java EE production deployment and frankly it seems pretty scary. After speaking to over a 100 people these are my hypotheses:

    1. Nobody uses redeployment in production (as in the actual button that does in-server update). It just isn’t reliable enough due to OutOfMemoryError-s and other failures.
    2. The common way to update an application is to:
      1. Take all servers down at 2am and hope no one is using it.
      2. Take servers down one at a time, upgrade them and either drop or migrate the user sessions.
      3. Use weird hacks like copying one file at a time.

    I’m also trying to find out how the update process happens, how hard it is and what does it cost in human measure (hours) and in soulless business measure (dollars).

    I ask you to help me out and provide me with some semi-solid data I can use to better understand what’s going on in reality. Hopefully you’ll prove me wrong. Afterwards, I’ll make all the data I’ve collected publicly available.

    The survey is completely anonymous and the email will only be used to send you the results. Please fill it in here: Share this with your friends/colleagues if it's not too much trouble.

    Threaded Messages (12)

  2. Hello

    If you are not confident in "redeploy" features, you can script the stop/undeploy/deploy/start sequence with python script and specific vendor API (Oracle WebLogic / IBM WebSphere).

    The process then becomes more reliable and time efficient, even if a full downtime is still required

  3. Jevgeni,

    OOME in repeated redeployment is typically due to people not shutting down services in their code during a redeploy using an ApplicationLifecycleListener. A common culprit we see in the field is Hibernate Session Factories etc. This will cause both PermGen leaks and general memory leaks. There used to be a JDK bug in the 1.4.x JVMs which would also cause PermGen leakage but I think they are all solved now. Coders have to be very careful to remove everything they setup during app initialisation when they are redeployed including MBeans etc. This will prevent OOME.

    The danger with the rolling restart option you mentioned is that all objects in the session need to be comparable otherwise you can get serialisation errors while replicating objects in the session due to incompatible object definitions. Alternatively you drop the user sessions which imho is a cop out as why give user's pain when you don't need to!

    If you design your application server clusters in a certain way it is possible to do a more sophisticated rolling restart that does not cause migration of sessions.

    You design your cluster so you can split it in half when you need to upgrade the application the;

    change your load balancer to redirect all new sessions to the remaining half of your cluster,
    drain down the sessions on the disabled cluster half,
    shutdown the disabled cluster half and upgrade the application
    restart the disabled half and enable for new sessions in the load balancer.
    repeat for the remaining half of the cluster.

    Depending on your cluster design this does not have to have a big impact on performance.

    Another alternative is to use side by side deployment which is supported by WebLogic

    Steve Millidge

  4. OOME in repeated redeployment is typically due to people not shutting down services in their code during a redeploy using an ApplicationLifecycleListener. 

    Thanks for the awesome answer, Steve, but I'm not buying this part. I worked as the R&D lead in a large consultancy for several years, and traced OOME in different apps to a variety of different issues including JDK, server, framework and application causes. In the end I understood that it's the result of a fundamental limitation of classloaders, which do not provide a proper isolation model. I wrote it up in the article "How do class loader leaks happen?" ( as well as the talk "Do you really get class loaders?" available in video on the same page. Unfortunately it almost never possible to fix all those issues, and even if you do they can come back at any time.

  5. way cheaper[ Go to top ]

    Yes and that's why it's usually done with rolling upgrades instead of re-deployment. It's very expensive to develop your apps and continously try to hunt down possible leaks, if even possible. It's much easier and cheaper to simply make some scripts to do a small impact rolling upgrade. Even large companies have budgets these days...


  6. Hot Redeploy[ Go to top ]

    People don't use [hot|re]deployment in production not because of memory issues but because operations has an important best practices to apply.

    Even if you can isolate classes from each other you cannot do the same for the external side effects (of multiple & different versions). This last item seems foreign to many OSGi disciples who apparently do not understand one of the main reasons why operations stop and restart processes in applying a managed change – to be sure the process can indeed be restarted when something goes terribly amiss in production outside of the container (hardware). It is much easier rolling back at the time the change is applied.‘09-the-conference/

  7. Hot Redeploy[ Go to top ]

    People don't use [hot|re]deployment in production not because of memory issues but because operations has an important best practices to apply.

    A better point, and valid for some apps, some orgs and some updates, but not all round. E.g. if I know for sure that the update only touches Java code, why should I do the whole rolling upgrade? It's quite possible to differentiate assets and updates and choose the strategy accordingly, with cold restart always a backup option.

  8. Hot Redeploy[ Go to top ]

    The point is that operations need to be sure that the starting (or restarting) of a change has been tested in production. Imagine finding out that you can't start an application after 10 or more so independent changes code or not in the middle of a panic situation and being forced to try each rollback. I don't know any organization that is willing to accept this other than reckless or inexperienced startups who probably had a wake up call last week.

    A hot deploy can't be realistically tested unless one keeps a running instance of the application (and all its instances) in a test environment that is kept completely in sync with every change that is rolled out in production including restarts. When you have managed to solve this problem then maybe then hot deploy (even the smallest of changes which btw is not necessarily indicative of its impact) might be considered.

    I am not saying hot deploy is not done in the wild - just not at mature well managed IT departments with some experience of change management as well as incident & problem management which sadly goes hand in hand.

  9. Live reload in production[ Go to top ]

    You may find it informative to look through discussion that's going on right now at Tapestry list:

  10. we use redeploy[ Go to top ]

    We have been using Java/J2EE on Unix and WebSphere since 2002. As the main core banking platform in the bank with 500+ branches and 10000+ users.

    We have an in-house built framework which includes custom classloaders, so all business components deploys are realized via hotdeployment, server restarts only occur when this framework is redeployed.

    Restarts do not introduce downtime[clustered architecture] so we do not wailt until 2am.

    No down-time at end of day processing also.A true 7x24 system.


  11. we use redeploy[ Go to top ]

    Users (real or virtual) observe and measure downtime. You can start, stop and restart JVM's willy nilly as long as it not observable. By your logic memory-to-file swapping as done by the OS would be treated as downtime in some form. Its best to start rethinking this because in the cloud vm/jvm instances will come and go. Eventually the concept of an process instance will become redundant in terms of service mgmt & capacity planning.

  12. One reason for not relying on hotdeployment or even a "warm" rollout based on separated strings or clusters, is dependencies on simultaneous external changes, i.e. database schemas, stored procedures etc. In this case you need to be sure that access to the database happens from and only from the updated code, since the previous version will err after updating the database.

  13. My real world experience (in a medium size production environment with +/- 20 WebSphere instances) is that the majority of the J2EE applications we have actually cause class loader leaks. An investigation has shown that this is caused by a mix of different issues:

    • Incomplete cleanup in application code (as mentioned by Steve Millidge).
    • Bugs in the JRE or the application server. E.g. our WebSphere environment is (or has been) affected by at least the following APARs: IZ67457, PK83186, PM04639, PM18729 and PM21638. In addition, we currently have 3 open PMRs about other class loader leaks caused by bugs in the application server.
    • Bugs in third-party libraries. E.g. we have several applications that still use Axis 1.4. All of them are affected by the bug described in AXIS-2674.

    On the other hand, that doesn't mean that we actually see OOM errors in production. The reason is that the number of redeployments in a production environment is not high enough, and that in many cases, a redeployment of an application comes with a configuration change that requires a server restart anyway. However, we do see OOM errors presumably caused by these class loader leaks in our development environments and sometimes in acceptance.

    One of the difficulties with class loader leaks caused by application redeployments (or restarts) is the lack of tools to detect them early. Most people only discover that their applications (or application servers) have this issue when they actually start seeing OOM errors, and it's not easy to determine where they come from. The only exception is Apache Tomcat which has a builtin feature that detects some (but not all) class loader leak issues. To have this capability in our environment, I started to write a tool [1] that uses the same approach as Tomcat, but that detects a larger set of class loader leaks and that runs on application servers such as WebSphere.