Have you ever deployed a change to production and thought “All went well – Systems are operating as expected!” but then you had to deal with users complaining that they keep running into errors?
We recently moved some of our systems between two of our data centers – even moving some components to the public cloud. Everything was prepared well, system monitoring was setup and everyone gave the thumbs up to execute the move. Immediately following, our Operations dashboards continued to show green. Soon thereafter I received a complaint from a colleague who reported that he couldn’t use one of the migrated services (our free dynaTrace AJAX Edition) anymore as the authentication web service seemed to fail. The questions we asked ourselves were:
- Impact: Was this a problem related to his account only or did it impact more users?
- Root Cause: What is the root cause and how was this problem introduced?
- Alerting: Why don’t our Ops monitoring dashboards show any failed web service calls?
It turned out that the problem was in fact
- Caused by an outdated configuration file deployment
- It only impacted employees whose accounts were handled by a different authentication backend service
- Didn’t show up in Ops dashboards because the used SOAP Framework always return HTTP 200 transporting any success/failure information in the response body which doesn’t show up in any web server log file
In this blog I give you a little more insight on how we triaged the problem and some best practices we derived from that incident in order to level-up technical implementations and production monitoring. Only if you monitor all your system components and correlate the results with deployment tasks will you be able to deploy with more confidence without disrupting your business.
Bad Monitoring: When Your End Users become your Alerting System
So – when I got a note from a colleague that he could no longer use dynaTrace AJAX Edition to analyze the web site performance of a particular web site I launched my copy to verify this behavior. It failed with my credentials which proved that it was not a local problem on my colleague’s machine: