As with most ops related checks, many of ours are rooted in previous issues we’ve encountered. We rely heavily on DNS for comms between our bird and consoles, and interruptions in DNS are something we want to know about early. Likewise, we want to ensure each customer console (plus other web properties) are accessible.
There are tools available for performing DNS and HTTP checks, but our needs around DNS are somewhat unusual. We want to be able to quickly confirm whether each of our consoles is responding correctly, across multiple third party DNS providers (e.g. Google, Cloudflare, OpenDNS). For a handful of domains that’s scriptable with host, but for many hundreds of domains this becomes an issue if you want to be able to detect issues fairly quickly (i.e. within tens of seconds of the failure).
To plug this gap we built Rapsheet, to give us a “list of crimes” of our systems against intermediary network service providers. In this post I’ll provide a quick run through of why we built it.
Goal: “zero measurements”
Much of the thinking was expounded on by this awesome Eric Brandwine AWS reinvent talk:
The key takeaway for this post is that alarms and dashboards will often lead to Engineers and other technicians developing what is known as “alarm deafness”. Primarily studied in the medical world, it describes the situation where operators rely on a raft of metrics with tunable parameters. If the parameters are too tight, the operators learn to ignore the alarm. If they’re too loose, bad things happen without anyone being the wiser. Alarm deafness grows when a specific check or metric is constantly in an “alerting” state. Eric points out that the best way to alleviate alert deafness, is to constantly be striving for “a zero measurement”, because as soon as you see anything other than a zero measurement, you know that action is required.
If you can find those zero measurement metrics then they provide clear tasks for the operations folks, for whom a key objective is to keep all panels in a non alerting state (i.e. zero measurement). With our current setup, I will drop most things I am busy with whenever I see one of the panels in any colour other than green.
A counter example of an actionable measurement is almost anything which provides a count or tally of an expected conditions (even if it’s an error.) For example, we rely on DNS for a range of services and are able to track DNS failures across multiple providers. A poor metric would be a graph displaying how many DNS queries have failed within the last few hours against any of the DNS servers. This graph may look interesting but nothing about the graph would lead to an exact action that engineers could take. Transient network issues lead to queries failing every so often and our engineers aren’t expected or authorised to fix the backbone network between our hosting site and the 3rd party DNS servers.
Each time a new panel is added to the dashboard it has gone through a number of iterations on the backend in order to ensure that it keeps to this mindset.
In order to “play well with others” and use the third party service fairly, Rapsheet has rate limits (especially since asyncio can push out lots of traffic). Most third party services let you know what these rate limits are so they can be implemented correctly. Interestingly, Google’s DNS service doesn’t. Only by practically testing the limits did we figure out what type of rate we should limit our queries against their DNS service.
Location, location, location
Almost all of Canary services run inside AWS. It therefore made sense to move the monitoring out of AWS. We didn’t want to fall victim to a situation where everything looked good to tests inside AWS, but due to AWS issues were silent to the outside world. So we run Rapsheet at a wholly independent provider.
So far, it’s proved quite adept at picking up issues in this hosting provider’s network…
The dashboard is being projected onto a wall in the operations office which is the main thoroughfare from our desks to the office facilities, so anyone working in the office has to see it when leaving their desks.
Rapsheet is a work in progress, but helps us make sure that things are ticking along as expected.