Is sampling data is as good as capturing all transactions in detail? At first glance having all transactions is better than just some of them – but let’s dig a little deeper.
If we’re not capturing all transactions – how do we select those we do capture? Or to be more precise, how do we select the transactions to be captured with full details vs. those with just high-level information.
One way of selecting the transactions you follow in depth would be to sample, meaning randomly select a certain percentage of transactions. For example you could choose to only follow every 50th transaction, resulting in a sample rate of 2%. While this reduces application overhead and the load on your monitoring solution – how can you be sure this really accurately represents your system? What if the slow request or error a user complained about is in the 98% you didn’t monitor?
Since application performance is a key driver for any APM solution picking those transactions with bad performance – meaning they are slow – is important. But what threshold are you using for “slow”? You could use any arbitrary value, but what value? Especially for newly deployed applications or applications with varying load patterns selecting such a value can be tricky.
You could use statistical measures to have the system baseline the slow threshold – e.g. using standard deviation. Those measures work best if the data is following a Gaussian distribution (also known as normal distribution) – which response times rarely do. Below is a snapshot of production traffic (around 6000 individual requests) showing a more commonly seen distribution pattern: