Is sampling data is as good as capturing all transactions in detail? At first glance having all transactions is better than just some of them – but let’s dig a little deeper.

If we’re not capturing all transactions – how do we select those we do capture? Or to be more precise, how do we select the transactions to be captured with full details vs. those with just high-level information.

Sampling

One way of selecting the transactions you follow in depth would be to sample, meaning randomly select a certain percentage of transactions. For example you could choose to only follow every 50th transaction, resulting in a sample rate of 2%. While this reduces application overhead and the load on your monitoring solution – how can you be sure this really accurately represents your system? What if the slow request or error a user complained about is in the 98% you didn’t monitor?

Errors

So what if we add monitoring those transactions that had errors? For web applications we can use HTTP error codes to determine failures (e.g. 404 or 500 errors). For other types of applications looking for warning/error log messages and exceptions is a good indicator. If your APM solution does include end-user monitoring you could even add client-side JavaScript errors. But we still don’t ensure we are getting details for the slow transactions – just random sampling and errors.

Slow Transactions

Since application performance is a key driver for any APM solution picking those transactions with bad performance – meaning they are slow – is important. But what threshold are you using for “slow”? You could use any arbitrary value, but what value? Especially for newly deployed applications or applications with varying load patterns selecting such a value can be tricky.
You could use statistical measures to have the system baseline the slow threshold – e.g. using standard deviation. Those measures work best if the data is following a Gaussian distribution (also known as normal distribution) – which response times rarely do. Below is a snapshot of production traffic (around 6000 individual requests) showing a more commonly seen distribution pattern:

Continue reading the rest of the blog ...