profit_image -

Incorporate telemetry to keep a DevOps tools list in check

Don't let a long list of DevOps tools burden your enterprise. Incorporate telemetry to simplify your tooling and prevent developers from working redundantly on similar problems.

As more enterprises adopt DevOps, forward-thinking engineers want to improve the ability for developers to better integrate telemetry with application development. But, because developers use a multitude of Java DevOps tools to customize applications and track metrics, it can be difficult to organize everything under one roof.

Said issue became a real problem at mortgage software giant Ellie Mae, as it ended up with a long DevOps tools list and a wide assortment of telemetry tooling, said Anthony Johnson, principal engineer at Ellie Mae, while speaking at Elastic{ON} in San Francisco.

Ellie Mae developers focus on the use of telemetry and technology to mitigate application challenges, but this eventually led to a kind of telemetry sprawl as their DevOps tools list grew. "As soon as someone had a problem with a tool, they would get another tool that solved that problem," Johnson said. "As a result, we have everything under the sun."

Tools like Splunk, SignalFx and Apache Solr pull data from streams, and AppDynamics manages application performance. The company has about 40 engineering teams composed of over 1,000 engineers. Tool customizations were poorly coordinated, as were data analysis routines, and resulted in the same work performed by different members and teams.

"More than one tool is expensive," Johnson said. "Everyone loves their tools and telemetry systems, but often good intentions don't lead to good results." Ellie Mae, he said, ended up with a lot of data but couldn't make sense of it in a coherent way.

Bring tooling to telemetry

Johnson's team needed to better organize the DevOps tools list. Initially, he and his colleagues found results would come from a new tool built on top of Elasticsearch code with Kibana for visualization. Still, this method involved meticulous ticketing cues, and it relied heaviliy on application and infrastructure administration and maintenance.

As soon as someone had a problem with a tool, they would get another tool that solved that problem. As a result, we have everything under the sun.
Anthony JohnsonPrincipal engineer, Ellie Mae

A better approach was to find a way to use microservices and build APIs that would make it easier to support and customize the tooling via a self-service model.

The team started with Python and Ruby to write the code and used Ansible and SaltStack for deployment. This combination made it easier to write new telemetry modules with a declarative programming model, which provides better abstraction layers to adjust the telemetry infrastructure compared with an imperative model, Johnson said. The declarative model also enables developers to work with standardized components and deliver more consistent results.

Another important consideration was to use a specification-driven approach to describe performance expectations. When performance drifts, this approach makes it easier to automate adjustments to bring the app back into conformance.

It was also important to implement proper version control for the code that controls this automation. A change history ensures that everyone on the team can track changes, and prior versions are available to make it possible to roll things back in the event that something goes awry.

Make it continuous

It was also important for Ellie Mae staff to think about how to continuously deploy the telemetry so it could be refreshed or used by other applications as required. Also, the team wanted to ensure ease of use with Elasticsearch instances as they were needed but without an overload to the Elastic server. Johnson used an API in Kibana to spin up new Elastic instances with Kibana tooling that works with it as code.

The team uses Terraform to automate infrastructure provisioning specified in the code, but it didn't have a plugin to support Elastic and Kibana. Johnson solved this challenge with the creation and publication of a plugin, available on GitHub, to help others and solicit feedback.

One of the benefits of this automated infrastructure provisioning approach is that it also makes it easier for developers to spin up instances on a local server so they can test their apps during development. Before implementing telemetry as code, it took a lot of time and effort to spin up the telemetry that went along with it.

Johnson said this approach gives developers a way to create locally and then use a Python script to export data to the source control. The idea is to treat a dashboard as source code in a way that makes it easier to test. As a result, errors in a dashboard show up as bugs, making them easier for application teams to fix.

Dig Deeper on DevOps-driven, cloud-native app development

App Architecture
Software Quality
Cloud Computing