How Instacart works around buggy Elasticsearch queries

Enterprises that use Elasticsearch to find dynamic information in other apps are struggling to identify errant code that stalls enterprise apps. In theory, application performance monitoring tools should help. But, it wasn’t enough for Instacart to identify the queries that consistently created problems for their consumers and shoppers, said John Meagher, senior software engineer, search infrastructure at Instacart.

Simply scaling up their Elasticsearch instance didn’t solve the problem. So, Meagher decided to find a better way to find out what was responsible for their performance issues. As it turned out, a small number of poorly coded Elasticsearch queries were responsible for most of their problems. Once they found a better way to monitor queries, those errors were reduced by 90% and a lot of other problems went away too.

Building a digital grocer

A key element of Instacart’s business was the creation of the world largest and constantly updated digital catalog of grocery items. Consumers can access the catalog through mobile and web apps when they order food from one or more stores. It’s also used to guide shoppers through store aisles who purchase food on behalf of the consumers. The app needs to present a different view of the information to customers and shoppers.

Elasticsearch sits at the core of this whole process. It makes it easy to surface a dynamic view of the available food and presents different options for consumers and shoppers. Instacart standardized its catalog management on top of Elasticsearch because it’s highly scalable in a way that makes it easy to update items and subsequent information. For example, they wanted a platform that allows one team to update nutritional information and description of a product, and also allow the store to update its inventory.

Elasticsearch makes it easy for developers to code logic that dynamically aggregates and generates information in response to complex queries on the fly. However, when Elasticsearch goes down, everyone else’s services do too. Elasticsearch’s catalog features almost 600 million items that are updated about 750 times per second. Its distributed makes it easier to spread queries across clusters. As a result, each cluster only has to handle about 500 queries per second, while the entire infrastructure handles about 15,000 queries per second.

The pain of buggy Elasticsearch queries

Instacart’s main application includes some outdated code from its founding along with code from new developers. As a result, it’s hard to find the buggy code when a problem emerges. “It feels like we are trying to find a needle in the haystack when looking for what is causing a problem in a cluster, only its worse. It’s more like trying to find one needle in a pile of other needles,” Meagher said.

In early 2018, Instacart would see tens of thousands of time out errors per day. Many components of the Instacart app wouldn’t wait for Elasticsearch queries to come back, and they’d time out early. Some of the particularly bad queries would see as low as a 10% success rate, and a few had a 0% success rate. The site would often go down on the weekends during peak shopping periods, and cause major issues for the app.

The biggest problem area with these code issues was a lack of visibility. Instacart developers could see bulk aggregated errors and latency, but couldn’t get the proper visibility into the code that caused the problem. In most cases, Instacart staff would just get reports that Elasticsearch was slow, but they wouldn’t be able to show if their Elasticsearch infrastructure was working or not. And, it was also challenging to see if specific queries or APIs were behind the problems.

Create a bigger picture

Instacart had a variety of tools that provided some part of the big code problem picture, but not the whole thing. They used Kibana, Java Mission Control, New Relic, JDK Flight Recorder and other APM tools to track app performance and error reporting tools to look at raw logs. For example, when a bad query hit, it would jam up the queue and all the other queries would slow down. It was hard to find the one at the root of the problem.

Meager led the development of a new type of Elasticsearch monitoring tool, called ESHero, to make it easier to diagnose which queries would create bottlenecks. The tool’s main insight was to provide a way to aggregate information across server applications that were Elasticsearch cluster clients.

They used a collection of Ruby applications that ran on each application server, pulled the data into a central repository and then use machine learning to make sense of it. The tool provided a way to instrument all the calls to the Elasticsearch cluster, and could be further explored via Elasticsearch queries.

An important element of ESHero was to find a way to identity particular queries. However, the challenge is that each query’s payload was slightly different. Meagher’s team found a way to strip out the dynamic information and replace it with an associated query ID with a specific application call. They also added in other data such as collection time and where in the code a query was called from.

Once they finished the first iteration, Meager was surprised to find that the Elasticsearch clusters were basically healthy. They problems, however, were mostly caused by the spillover impact of poorly coded queries.

These insights gave them a way to prioritize development on the worst-performing queries, and to think about ways to retry good ones. For example, a small number of queries dominate shopping patterns and when these stall, so does the user experience. So, the team decided to focus on aggressively retrying the stalled queries, but they found that an arbitrary number of retries is extremely dangerous. If a well-formed query experiences a 10% success rate, further retires can create more problems.

After Meagher’s team identified and fixed the worst code, Instacart went from about 60,000 time outs per day to about 2,000.

Before they started this work, the site went down almost every weekend. “Now when my partner went on paternity leave for three months, I was happy to be on call,” said Meagher. Instacart hasn’t open-sourced ESHero yet, but Meager said he would be happy to work with others interested in the deployment of similar tools in their own organization.