auremar - Fotolia

New distributed tracing API completes the feedback loop

As apps get moved to the cloud and software has no fixed address, tracing and troubleshooting are challenges. However, a new distributed tracing API promises to simplify things.

Modern distributed systems are built on top of a rich set of microservices and cloud APIs in order to break up complex applications into smaller chunks of code. While each service is easier to debug on its own, it can be challenging to find problems that arise from the interaction of these services. A critical part of closing the DevOps feedback loop lies in figuring out how these apps perform in a way that makes it easier to fix bugs and performance bottlenecks.

OpenTracing is an emerging specification from the Cloud Native Computing Foundation (CNCF) that promises to make it easier for developers to find and fix these problems. Dan Kohn, executive director of the CNCF, said: "OpenTracing essentially provides an API that makes it easier to integrate data from various logging tools into distributed tracing tools."

These distributed tracing tools and services include offerings from LightStep and Datadog, as well as open source tools, like Appdash, Uber's Jaeger and Zipkin. OpenTracing libraries are available in nine languages, including Go, JavaScript, Java, Python, Ruby, PHP, Objective-C, C++ and C#.

Sleuthing for the root cause

Distributed tracing makes it easier to optimize end-user latency, do root cause analysis for errors and make sense of how distributed systems are connected. Before OpenTracing, this data was often locked into various silos around proprietary logging and application performance management (APM) tools. OpenTracing allows an enterprise to ingest performance data from a wide variety of tools, which can be pulled in to the distributed tracing tool of choice.

It is already being used on large-scale distributed applications run by Lyft, Twilio and Yext. Twilio reduced incident resolution times by 92%. Twilio SVP of Platform Jason Hudak said its billing transactions team found an issue in the first hour that reduced latency by 70%. Lyft's systems generate more than 100 billion microservices calls per day. Distributed tracing allows the app's teams to identify the root cause of performance problems, all the way from the mobile front end to the bottom of its distribution stack.

Kohn said the OpenTracing community was largely championed by Ben Sigelman, who helped develop Google's Dapper technology for analyzing Google's complex app infrastructure. Over the last couple of years, Sigelman played a key role in working with APM and log analysis tools to create a common API for sharing data. Sigelman also recently launched a new company called LightStep that has developed a distributed tracing tool with a minimal footprint and a high level of precision.

Combining statistics and events

I think of distributed tracing as table stakes for modern app infrastructure.
Ben SigelmanCEO and Co-founder, LightStep

In a historical context, tracing meant tracking event flow on a single kernel. As organizations adopt DevOps, they invariably decouple their architectures into many smaller services, which makes it harder to track event flow across boundaries. Distributed tracing extends the concept of traditional kernel tracing to modern distributed architectures. For example, with Lyft, it is normal for traces to touch thousands of services before delivering a response to a user within a few hundred milliseconds. This makes it difficult to make sense of interactions between services.

"There are really only two types of monitoring data: events or statistics," Sigleman explained.

Event data is gathered by logging tools, while statistics look at events and CPU usage to create a big picture. Tracing and logging are similar, but tracing adds the ability to correlate logs with specific transactions and application code execution. With large systems, the apps typically use concurrency and parallelism to make things faster, but this can make it harder to identify the specific call throttling aggregate app performance.

Statistics and event analysis complement each other.

"Doing one or the other is good, but [when] you can combine them at the lowest level, you get the best of both worlds," Sigelman said. "It is the connection of statistics to transactions that is technically challenging and the core of what we are solving at LightStep."

Statistical analysis makes it possible to detect anomalies that need investigation. Event analysis can help direct a developer to code that might be responsible for a problem.

Going forward, distributed tracing is likely to become a base-level requirement for new app architectures.

"I think of distributed tracing as table stakes for modern app infrastructure," Sigelman said. "Enterprises building software need distributed tracing just to explore how it works in practice. It is just the way of getting that transaction data in front of a human."

Dig Deeper on Development tools for continuous software delivery

App Architecture
Software Quality
Cloud Computing