Introducing Rapsheet

We've got hundreds of servers and thousands of Canaries deployed in the world. Keeping them healthy is a large part of what we do, and why customers sign up for Canary. Monitoring plays a big role in supporting our flocks and keeping the infrastructure humming along. A pretty common sight in operations are dashboards covered with graphs, charts, widgets, and gizmos, all designed to give you insight into the status of your systems. We are generally against doing things “just because everyone does it” and have avoided plastering the office with “pew-pew maps” or vanity graphs.

(although the odd bird-migration graph does slip through)

As with most ops related checks, many of ours are rooted in previous issues we've encountered. We rely heavily on DNS for comms between our bird and consoles, and interruptions in DNS are something we want to know about early. Likewise, we want to ensure each customer console (plus other web properties) are accessible.

There are tools available for performing DNS and HTTP checks, but our needs around DNS are somewhat unusual. We want to be able to quickly confirm whether each of our consoles is responding correctly, across multiple third party DNS providers (e.g. Google, Cloudflare, OpenDNS). For a handful of domains that's scriptable with host, but for many hundreds of domains this becomes an issue if you want to be able to detect issues fairly quickly (i.e. within tens of seconds of the failure).
To plug this gap we built Rapsheet, to give us a “list of crimes” of our systems against intermediary network service providers. In this post I'll provide a quick run through of why we built it.

Goal: "zero measurements"

In this post, I am going to dive a little deeper into the thinking behind Rapsheet and why the dashboard behaves in the way it does. Why we aim for zero measurements to be the goal and what this actually means. (Hint: it doesn't mean "take no measurements".)
Much of the thinking was expounded on by this awesome Eric Brandwine AWS reinvent talk:


If you have not yet watched it, you really should. Eric is whip smart and is an excellent presenter.
The key takeaway for this post is that alarms and dashboards will often lead to Engineers and other technicians developing what is known as “alarm deafness”. Primarily studied in the medical world, it describes the situation where operators rely on a raft of metrics with tunable parameters. If the parameters are too tight, the operators learn to ignore the alarm. If they’re too loose, bad things happen without anyone being the wiser. Alarm deafness grows when a specific check or metric is constantly in an “alerting” state. Eric points out that the best way to alleviate alert deafness, is to constantly be striving for “a zero measurement”, because as soon as you see anything other than a zero measurement, you know that action is required.

If you can find those zero measurement metrics then they provide clear tasks for the operations folks, for whom a key objective is to keep all panels in a non alerting state (i.e. zero measurement). With our current setup, I will drop most things I am busy with whenever I see one of the panels in any colour other than green.

A counter example of an actionable measurement is almost anything which provides a count or tally of an expected conditions (even if it’s an error.) For example, we rely on DNS for a range of services and are able to track DNS failures across multiple providers. A poor metric would be a graph displaying how many DNS queries have failed within the last few hours against any of the DNS servers. This graph may look interesting but nothing about the graph would lead to an exact action that engineers could take. Transient network issues lead to queries failing every so often and our engineers aren't expected or authorised to fix the backbone network between our hosting site and the 3rd party DNS servers.



Instead we only track which of our servers are not currently responding correctly to our various DNS queries. With this metric, it becomes a lot easier for us to determine patterns, and therefore understand the root cause of the problems that are responsible for the failing DNS queries. For example, we rely on Rapsheet to tell when a particular DNS provider is experiencing issues, or if a particular server is experiencing DNS issues, or whether one of our domains has been flagged as unsafe. This then leads to the issues being mitigated and resolved more timeously.

Architecture

At it's core it's straight forward. A long running processes uses asyncio to periodically perform a whole bunch of network tests (DNS, HTTP), pumps the results into InfluxDB and then we build a simple UI in Grafana.

The tests are tied together based on a customer, so we can quickly determine if an issue is customer-specific or more general.

Rapsheet has modules to implement each of its tests. Current modules include DNS checks against four different DNS providers. Different HTTP health checks including reachability, served content and endpoint checks against site blacklists such as Google’s SafeBrowsing. All endpoints are asynchronously checked and the results collated before posting the metrics into Grafana.  

Each time a new panel is added to the dashboard it has gone through a number of iterations on the backend in order to ensure that it keeps to this mindset.


It’s built to be extensible, so we can quickly add new zero-based measurements we want to track. 

A recent example was exception counting. We have as a design goal the fixing of unhandled exceptions. The dashboard has an “exception monitoring” panel that tracks the number of unhandled exceptions across all of our deployed code (on customer consoles and our own servers). Coming back to the notion of focus, it becomes a very clear goal for me: handle the exceptions that are cropping up. Most of the time it involves a deep dive into what is causing the exception and how we should best mitigate those cases. When introduced, the panel would get tripped up with noisy but harmless exceptions (and those panels would just burn red). After a flurry of root cause fixes, and thanks to the goal of driving them to zero, we only get a couple every other day across hundreds of production systems (and work is under way to alleviate those too).

Rate-limiting

Some of the panels involve performing requests or queries against third party services. For example, we want to know if someone using the Chrome browser is able to login to their console.
In order to "play well with others" and use the third party service fairly, Rapsheet has rate limits (especially since asyncio can push out lots of traffic). Most third party services let you know what these rate limits are so they can be implemented correctly. Interestingly, Google's DNS service doesn’t. Only by practically testing the limits did we figure out what type of rate we should limit our queries against their DNS service.

Location, location, location

Almost all of Canary services run inside AWS. It therefore made sense to move the monitoring out of AWS. We didn’t want to fall victim to a situation where everything looked good to tests inside AWS, but due to AWS issues were silent to the outside world. So we run Rapsheet at a wholly independent provider.

So far, it's proved quite adept at picking up issues in this hosting provider's network...

Wrap-up

The dashboard is being projected onto a wall in the operations office which is the main thoroughfare from our desks to the office facilities, so anyone working in the office has to see it when leaving their desks.


(Of course, it’s online for remote workers who are able to keep it open on their desktops if needed.)

Rapsheet is a work in progress, but helps us make sure that things are ticking along as expected.

3 comments :

  1. This comment has been removed by the author.

    ReplyDelete
  2. The developed diagrams and graphs give a complete picture of the operation of each system. The failures will also be visible for the future fixes.

    ReplyDelete