Java Development News:

Troubleshooting slow response times when activity, data volumes spike

By Maxine Giza

25 Sep 2013 | TheServerSide.com

Effectively measuring and managing computer system latency reduces cost and time for organizations. Unfortunately, troubleshooting slow response times when activity and data volumes spike is tricky, requiring apt analysis of critical latency metrics, according to Gil Tene, chief technology officer and co-founder of computer appliance manufacturer Azul Systems and speaker at Oracle OpenWorld 2013. During his session, "How not to measure latency," Tene will give advice on response time testing and building latency behavior requirements.

Oracle OpenWorld 2013 logo Simply put, as load increases, so does computer system latency. If a system is given more work than it can handle, it will slow down. Failing to use appropriate metrics is just one common pitfall Tene said he's seen people succumb to. Fortunately, there are actions, such as creating a clear set of requirements, people can take to contain risk and enable good application behavior.

Developers must use care when viewing data derived under test conditions from load generators. For example, different conclusions could be drawn when looking at a single instance of a response time versus response times over a designated period. All too often, people assume that response time is a function of load, according to Tene.

Not reporting critical latency measurements, such as max latencies, is another mistake Tene has seen software professionals make in handling spikes in response times. This causes problems because software latencies are usually strongly multimodal. Software latencies "cannot be modeled or summarized with a mean and standard deviation without losing sight of the most critical latency behaviors: the bad ones," he said.

Latency does not live in a vacuum, and neither does throughput.

Gil Tene,
chief technology officer, Azul Systems

What Tene refers to as coordinated omission, a type of measurement error, is another common misstep, he said. One way this happens is running a load test and using the results log to generate diagrams. This only works, Tene said, if response times are short enough that the next request isn't delayed.

"Nearly all commonly used load generators will exhibit this behavior in the real world if test threads are asked to issue requests more frequently than once every 20-plus seconds, which means the way most people use load generators leads to significant coordinated omission in results and to bad reporting," Tene said.

A simple way to avoid mistakes is to create a clear set of requirements for latency behavior. "When you don't know what your requirements are, you probably don't know what you should be measuring and how those measurements should be done," Tene said. "Establishing latency behavior requirements and a pass/fail criteria ahead of spending time and effort on measurement will usually help focus testing in the right direction."

Failure to measure throughput without latency is, generally speaking, useless. "This usually results in the poor practice of projecting latency behavior from saturated system tests or throughput capabilities from tests that don't measure latencies," he said. "Latency does not live in a vacuum, and neither does throughput."