Detecting hung threads in Java with call execution stack marking and tagging


News: Detecting hung threads in Java with call execution stack marking and tagging

  1. The JVM (and probably many other languages & runtimes as well) sorely misses three very useful serviceability features with regard to call execution stacks:

    • ids: a stack frame is given a unique (within the thread context) identifier prior to being pushed onto the call execution stack
    • marking: recording the number of mark operations performed since a frame was pushed onto the call execution stack
    • tagging: a stack frame will take either a tag value set at the thread level or process level prior to being pushed onto the call execution stack

    With identifyingmarking and tagging the job of detecting hung or delayed threads (deadlocked, busy or spinning) is made far easier because it is possible to easily determine whether a repetitive call stack trace across multiple dumps is indeed the same execution call path instance or an entirely new execution (for the same thread). This is made all the more problematic with thread pools and workers (code blocks) that follow the same flow but which operate on different data types not discernible from the code itself.

    Fortunately it is possible to enrich a Java runtime with these capabilities while we await JDK 99 ;-)

  2. But how do you think it is possible to chage the JVM options in Production and how do we manage to convince the client to install some third party tool into production?

    Any ideas or thoughts?

  3. I could give you 100s of reasons why a customer should have a monitoring agent, especially one that intelligently manages measurement overhead & automatically determines what is truly relevant, but at the end of the day most customers will not listen until they have a serve outage (availability) or performance degradation and then realize they are completely clueless to how, why and when its is likely to happen again. Organizations just like humans have to fail first before they perceive the risk in not taking appropriate measures and actions.

    Leaving side monitoring I think it is becoming critically important that all software have built in self-observation, -learning and -regulation. Many aspects of the Java Hotspot runtime much now move up into the application stack to better protect and police resource usage in the context of the user, app and workflow. OpenCore metering runtime is a realization of this future direction for many apps, runtimes, platforms and stacks.

    Anyway Java should have this built in along with many other research items I am working on including singaling as an alternative approach to exception mgmt (throwing, catching, logging,...)