“Why did our revenue decline last month? Is something wrong with our systems?” This is the last thing you want to hear from your boss. But even worse than hearing such words is not being able to deliver any answers, largely because your application logs are scattered across an innumerable amount of individual servers. Things break, but if you're managing your log files properly, not only will you be more likely to anticipate an impending outage, but when on problem does actually occur, you'll have the ability to look in the right places, and quickly come up with a solution to your problem.
Log mining to the rescue
In the single-server days, it was easy to simply grep through a log file or two. But in today’s distributed computing environment, it can be very difficult to trace down which server you need to search on, or log into multiple servers to follow a trail of what happened to cause a particular event. The days of isolating a software bug with a simple grep command are over. To diagnose a runtime problem, especially in a modern, enterprise wide system,you really need to dig into logs, and search through them thoroughly. This has become increasingly true as developers and operators transition into one DevOps role, where the developer is not only responsible for building code, but also code maintenance and runtime operations.
Thankfully, there are many services available specifically for log aggregation. The idea behind log aggregation is to not only get all of your logs in one place, but to turn those logs into more then just text. To be useful, logs have to be something you can search on, report on, and even get statistics from. Graphs and charts are much easier to read than millions of lines of logs. Over the last few years, I’ve experimented with several different types of log aggregation systems, some of which focused on the system level, and some on the application level. In the end, they all lead to one primary purpose: helping you find, alert, and diagnose.
Papertrail is a very simplistic system. It takes in logs from Syslog or rsyslog and spits them out again. It even has a simple command line client so you can easily tail log files as they’re coming into your system. It provides you with easy access to any recent logs. It’s a true replacement for grep on multiple servers, but forget about digging much deeper then the past million records.
The further back you have to trail the longer it takes. Some searches were taking us 15 minutes or more to complete. If you want to grab logs from last week? Forget about it.
They do have a competitive pricing model, and they offer a lot of auto-discovery of fields and automatic highlighting, but nothing very advanced. Paper trail is essentially like running tail -f |grep on all your servers at once. The command line client can do exactly that as well, which is nice to be able to run on-the-fly, but this is no doubt a technical tool developed by people who do not specialize in dealing with non-technical people.
Simple, but perhaps too simple for any hard-core use.
- Simple interface for tailing log files
- Fast to get logs into the system and read them
- Easy to diagnose issues as they are happening
- VERY slow at reading older logs
- Not useful for diagnosing issues that have already happened
- No advanced graphing
- Not friendly for non-technical staff
- No dashboards
This was actually a recommendation to me after I wrote a blog post about our search for log services. After checking it out for a little while, I realized pretty quickly that the service was not for us. While Logentries does do application level logging pretty well, it does not do system level logging. In order to really take advantage of Logentries, you have to specifically code your applications to push to it.
What that means is that as soon as you’re committed to Logentries, it’s much more difficult to switch to something else. While that’s good for them, it’s not good for you or your developers. We wanted a system that would work with our existing log formats, our existing logging methods, and just hook into that. Logentries was not that solution.
- Simple Interface
- Only supports Application-level logging
- Vendor lock-in
Logstash, developed by my college friend Jordan Sissle, is a simple tool used to parse and manage arbitrary lines of text and get it into meaningful data. Despite it’s name, Logstash is not restricted for sole use as a log-aggregation system. Instead, it can be used by anything that has both a time component, and text associated with it. For example, some users have used Logstash to parse Twitter, and search and graph on results from it.
Unfortunately, with Logstash you must maintain your own servers, there is no SaaS offering of it. Additionally, the best interface to it so far is Kibana, which is severely lacking in both features and color. It’s very archaic looking, and does not support post-processing of logs on-the-fly as you must define your log formats before sending to Logstash. If you send a format that matches an existing format, but breaks slightly in one way, your entire log might be dropped.
We had several bad experiences with logs being dropped without warning, simply because we slightly changed our logging format, or because one log defined err as a string, and another sent it as an object with a code and name. Either way, when a problem happens is a really bad time to find out that your logs don’t work.
Being an open-source product, it’s very easy to modify Logstash to be what you want or need, or even swap out the front-end interface, but that’s not what I wanted to do. If you’re looking for an already existing system that’s just ready for you to start sending logs to, Logstash isn’t it.
- Cheap, in fact it's free
- Very extensible
- Offers easy API
- Can abstract log sending to Redis to prevent issues with logger being down
- ElasticSearch dies a lot
- Prone to issues when upgrading
- Changes in Scheme or Data Format breaks everything
- No built-in alerting or reporting
Graylog2 is very similar to Logstash. Graylog2 simply adds a nicer interface then Kibana, and some more sane defaults that help parsing basic logs. It’s also much easier to install then Logstash, and slightly less prone to errors since they bundle a customized version of Elasticsearch in with the platform. Still, if your disk fills up, it’ll happily continue trying to write data until the entire system falls over.
- Cheap (Free!)
- Sane default searching
- Simple dashboards and interface
- Still uses ElasticSearch so prone to errors
- Can’t use the latest ElasticSearch since it’s always highly customized
- Requires your own servers and custom setup
- Still no advanced searches or graphing
- No built-in alerting or reporting
Read more in Part 2, where we discuss CloudWatch Logs, Loggly, and Splunk.
Which log aggregator and analysis tool do you use? Let us know.