Ever since reading ‘The DevOps Handbook’ a few months ago (highly recommended for absolutely anyone involved in software development) I’ve developed a bit of an obsession with environment monitoring and making our environments visible to absolutely everyone in the company.
The benefits of this are well documented – but let me cover a couple of them.
As software companies move into more frequent production deployments the need for fast feedback from production increases. The use of Application Performance Management (APM) solutions has increased. These solutions allow us to monitor the performance of our production environments and allow us to quickly identify any slowdown which may be negatively impacting user experience. They can also be useful in identifying the root cause of any errors in the application.
In an ideal world the APM solution can highlight any issues (slow performance, exceptions, services not responding etc.) so that engineering can be proactive in fixing these issues. In my mind, this means problems can be fixed before the end user even has to report a bug.
In order to achieve this ideal scenario, our production environments need to be visible to everyone within the engineering department.
It was soon after learning about environment monitoring that I was of the opinion that our production environment was not as visible as it should be. We had a tool created by an Architect which pulled down all logging information from our APM solution (Application Insights) for a specified time period and placed it into a spreadsheet.
This tool was really useful, and I was an avid user, but getting people to view the logs on a consistent basis proved to be a challenge.
My First Step to a Solution
I wasn’t convinced that this tool was being used widely enough and therefore decided to try and come up with the beginnings of a more permanent solution. We needed something which would be visible to everyone in the office.
We have multiple TV screens dotted around the office which usually display client bug counts. The idea came to me that we should use some of these screens to display a live representation of our application’s production performance and any errors which may be getting thrown. We could also easily integrate the dashboard into our SharePoint launch page so that it’s the first page everyone sees as they log in for the day.
My thinking was that if we displayed a count of slow API response times in a chart on a dashboard, and people walked past and saw this number was really high, then they would be inclined to dig deeper into the logs.
By making any problems visible at a glance of a TV screen or internet launch page, we, as engineering, would have instant visibility of our production environment. This would make us more proactive in dealing with any production issues.
Creating the Dashboards
I required support from colleagues when it came to creating the dashboards. I had an idea – but creating a decent looking graph was going to require assistance from architecture (who could help me write Application Insights Analytics queries and pointed me in the direction of the PowerBI tool) and Business Intelligence (BI – to help me present the data in a suitable way).
Application Insights Analytics is a powerful search tool which allows you to return specific logging information, without having to manually click through the Azure Portal UI. You can search through the entirety of your logging output (from up to 90 days ago) with a relatively simple query. The queries are written in a SQL-like (AIQL) language, an example is below:
| where timestamp > ago(30d)
| summarize ClientCount = dcount(client_IP) by bin(timestamp, 1h), resultCode
I was lucky enough that I didn’t have to create my queries from scratch – I could use the queries already created for our current logging tool and edit them slightly. This actually turned out to be a lot easier than I first anticipated. I did need a bit of a push to jump into it at first but once I understood what the existing queries were doing, manipulating them to meet my needs turned out to be relatively painless.
I would like to store the queries I use to pull back the production metrics in a central location. This would then allow anyone who wishes to investigate the data displayed on the dashboard to run the query on Application Insights Analytics and dig a bit deeper into the root cause of any problems.
Now that I had the relevant metrics I began discussing my dashboards with a developer who has BI knowledge. It was decided it would be best to display the last seven days worth of data on each dashboard to give the metrics some context. The reason for this I’d like to highlight in this example:
If there were 50 exceptions on one API after deployment on a Friday, but there were 1000 the day before the release, would you be concerned?
How about if those 50 exceptions appeared on the Friday but there were 0 the day before? That could possibly change how you interpreted the same figure as above.
I initially created 3 dashboards using Microsoft’s PowerBI tool – one displaying the last 7 days worth of performance metrics, another displaying exceptions on WebJobs and APIs over 7 days and the third displaying a combination of both but over a 24 hour period.
PowerBI plugs in seamlessly to the Analytics queries described above and automatic data refresh times can be specified. Once the data refreshes, the dashboards automatically display the updated data.
The tool was actually pretty simple to get to grips with – the skill lies in being able to present the data in a sensible way. I really wanted to highlight spikes in performance issues and application errors at a glance.
I had my three dashboards reviewed by a BI developer and after some discussion we narrowed my suggestions down to displaying a combination of the first two dashboards in a single page. He believed there was an overlap in the data being shown in the first two dashboards and the third dashboard was only really valuable if I could display a live streaming of data. This idea has been set aside for now, but it will be the second dashboard I create.
So what did I decide to include on my dashboard? In summary, there are four graphs, all displaying data from our production environment over the past 7 days:
- A line graph – with a constant line highlighting an NFR – measuring one of our main WebJobs throughput for each day.
- A horizontal stacked bar graph – displaying all exceptions from the multiple different applications in our software.
- Another horizontal stacked bar graph – highlighting all API responses taking longer than three seconds.
- A vertical stacked bar graph – displaying 90th Percentile of API response times for values greater than one second.
I believe that this dashboard is a great proof of concept, using the tools we already had at our disposal. I can already think of multiple possibilities for future dashboards. Of course, dashboards are only one way of monitoring production – smoke tests and sensible alerts are next on my list to tackle.
At the time of writing the dashboard is being reviewed at upper management level, licensing, logistics of displaying it on the screens/SharePoint etc is being discussed.
I’m happy with what I’ve achieved up to this point – I’ve proven that production monitoring doesn’t have to be overly complicated. If I can come up with a basic dashboard, then anyone can. Now I need to build on this so that we can have a constant stream of feedback from our production environment.