There are two general techniques used in performing anomaly detection in software systems. The first technique is based on time series analysis of sampled measures (metrics) which is generally done offline (or online but sufficiently in the past). The second technique is event based comparing one or more event specific measurements (clock, cpu,…) with predefined or dynamic thresholds, which is generally performed at the point of its occurrence (in time and space).

In the context of event based analysis a number of approaches have been used that allow moving on from detection through to root cause analysis. One approach used by solutions that are largely call stack sample based in their measuring of code performance is to have each thread on beginning of a request to register its self for observation with a supervisory thread. Thissupervisory thread then every so often (in milliseconds) checks on the progress of the registered threads. When the supervisory thread detects a thread has passed the time threshold (which maybe pre-defined or dynamic) for a particular request/operation it begins sampling the call stack of the request thread at regular fixed intervals (in milliseconds) until the thread eventually completes and unregisters itself for further observation until the next request.

http://www.jinspired.com/site/from-anomaly-detection-to-root-cause-analysis-via-self-observation