If i run your software, can you hack me?

In our previous post (Are Canaries Secure?) we showed (some of) the steps we’ve taken to harden Canary and limit the blast radius from a potential Canary compromise. Colloquially, that post aimed to answer the question: “are Canaries Secure?”

This post aims at another question that pops up periodically: “If I run your Canaries on my network, can you use them to hack me?”

This answer is a little more complicated than the first, as there is some nuance. (Because my brutally honest answer is: “yeah… probably”.)

But this isn’t because Canary gives us special access, it’s true because most of your other vendors can too.If you run software  with an auto update facility (and face it, it’s the gold standard for updates these days), then the main thing stopping that vendor from using that software to gain a foothold on your network is a combination of that vendor's imagination, ethics, or discomfort with the size of jail cells. It may not be a comfortable fact, but fact remains true with no apparent appreciation for our comfort levels.

Over a decade ago we gave two talks on tunneling data in and out of networks through all sorts of weird channels (the pinnacle was a remote timing-based SQL injection to carry TCP packets to internal RDP machines). [“It’s all about the timing”, “Pushing the Camel through the Eye of the Needle”] 


The point is that with a tiny foothold we could expand pretty ridiculously. Sending actual code down to my software that’s already running inside an organization is like shooting fish in a barrel. This doesn’t just affect appliances or devices on your network, it extends to any software.

Consider VLC, the popular video player. Let’s assume it’s installed on your typical corporate desktop. Even if you reversed the hell out of the software to be reasonably sure that the posted binaries aren’t backdoored (which you didn’t), you have no idea what last night’s auto-update brought down with it. 

You don’t allow auto-updates? Congratulations, you now have hundreds of vulnerable video players waiting to be exploited by a random video of cats playing pianos. 

This ignores the fact that even if the video player doesn't download malicious code, it's always a possibility that it simply downloads vulnerable code, which then allows the software to exploit itself.

It’s turtles all the way down.

So what does this mean? Fundamentally it means that if you run software from a 3rd party vendor which accepts auto-updates (and you do) you are accepting the fact that this 3rd party vendor (or anyone who compromises them) probably can pivot from the internet to a position on your network.

Chrome has successfully popularised the concept of silent auto-updates and it’s a good thing, but it’s worth keeping in mind what we give up in exchange for the convenience. (NB. We’re not arguing against auto-updates at all; in fact we think you’d be remiss not enabling them). 
You can mitigate this in general by disabling updates, but that opens you up to a new class of problems with a only handful of solutions:
  • A new model of computation – You could mitigate this by moving to chromebooks or really limited end-user devices. But remember, no third party Chrome apps or extensions, or you fall into the same trap.
  • You can be more circumspect about whose software you run. Ultimately the threat of legal action is what provides the boundaries for contracts and business relationships, which goes a long way in building trust in third parties. If you have a mechanism to recover damages from or lay charges against a vendor for harmful actions, you’ll be more likely to give their software a try. But this still ignores the risk of a vendor being compromised by an unrelated attacker.
  • You can hope to detect when the software you don’t trust does something you don't expect.
For the second solution, software purchasers can demand explanations from their vendors for how code enters the update pipeline, and how its integrity is maintained. We’ve discussed our approach in a previous post. (It’s also why we believe that customers should be more demanding of their vendors!)

The last solution is interesting. We’re  obviously huge fans of detection and previous posts even mention how we detect if our Consoles start behaving unusually. On corporate networks, where the malicious software could be your office phones or your monitors or the lightbulbs, pretty much your only hope is having some way of telling when your kettle is poking around the network.

Ages ago, Marcus Ranum, suggested that a quick diagnostic when inheriting a network would be to implement internal network chokepoints (and to then investigate connections that get choked). We (obviously) think that dropping Canaries are a quick, painless way to achieve the same thing.

It's trite, but still true. Until there are fundamental changes to our methods of computation, our only hope is to “trust, but verify” and on that note, we try hard to be part of the solution instead of the problem.


Are Canaries Secure?

What a question. In an industry frequently criticised for confusing security software with secure software, and where security software is ranked poorly against other software segments, it's no surprise we periodically hear this question when talking to potential customers. We figured we'd write a quick blog post with our thoughts on it.

We absolutely love the thought of this question coming up. Far too many people have been far too trusting of security products, which is how we end up with products so insecure that FX said you'd be "better off defending your networks with Microsoft Word".

In fact, it's one of the things we actively pushed for in our 2019 talk on "the Products we Deserve":

So, how do we think about security when building Canary?

Most of our founding team have a long history in offense and we've worked really hard to avoid building the devices we've taken advantage of for years. From base architectural choices, to individual feature implementations, defensive thinking has been baked into Canary at multiple layers.

We're acutely aware that customers are trusting our code in their networks. We go to great lengths to ensure that a Canary does not introduce additional risk to our customers. The obvious solution here is to make it "more secure" (i.e. that it's a harder target to compromise than other hosts on the network). But that's not sufficient, a harder target is not an impossible target given enough time.

So the second part of "not introducing additional risk" is to ensure that there's nothing of value on the Canaries themselves that attackers might want.

tl;dr: Canaries should be harder to compromise than other targets and should leave an attacker no better off for compromising them.

What follows are some examples of our thinking. We've left out some bits (where prudent), but we (strongly) feel that customers should be asking vendors how they reduce their threat profile, and figure we should demonstrate it ourselves.

Implementation

All the important services on the Canary are written in memory safe languages and are then sandboxed. The Canary itself holds no secrets of importance to your network. Choosing memory safe languages has a performance tradeoff, and one we're happy to make. With that architectural decision, the only potential memory corruption bugs are in the underlying interpreter, which is well-tested (and harder to reach) at this point.

Network spanning

We also don't allow Canaries to be dual-homed or span VLANs. That's because it would violate the principle of not having anything valued by an attacker on the Canaries. Compromising a dual-homed Canary would allow an attacker to jump across networks, and we won't let this happen on our watch.

Cryptographic underpinnings

During their initial setup, Canaries create and exchange crypto keys with your console. From that point on, all communication between the Canary and your console is encrypted using these keys.

The underlying symmetric encryption library used is NaCl, which provides the Salsa20 stream cipher for encryption and Poly1305 MAC for authentication. Again, we could have chosen slightly more space-efficient cryptographic constructs, but we followed the best practice of selecting a cryptographic library which doesn't permit choices and removes all footguns.

Updates

Our birds are remotely updated to make sure they stay current, and that's a common subject of questions from potential customers. To maintain the integrity of our updates, your Canary will only accept an update that's been signed by our offline signing infrastructure. Furthermore, each update file is further signed (and encrypted) by your Console so your bird won't accept an update from another Console (even if it's a legitimate one). Lastly, the update is delivered via our custom DNS transport overlay which is also encrypted. An attacker wishing to push code to your Canary would need to compromise both your cloud Console, as well as the physical offline update-signing infrastructure.

Console monitoring

Your Console is a dedicated instance running on EC2. This simple architectural decision means that even if one customer-console was breached, there's no other customer data present. This single-tenant model also removes the risk of web-app bugs yielding data from other customers.

Aside from the usual hardening, we've taken other steps to further minimise "surprises". All syscalls across our fleet are monitored, and any server doing anything "new" quickly raises alarms. (We also make sure that the server only serves content we've expressly permitted).

By default, your Console won't hold any special data from your network. Alerts come through with information related only to a detected attack, and even though we support masking in the alert to make sure that you wont have an attacker supplied password lying in your inbox, its probably a good idea to cycle a password that an attacker has made use of. :)

(Password "masked" in email alert)


Customer-Support access

On the back-end, selected Thinkst staff need to jump through several hoops and jump-points before gaining access to your console. At every jump, they are required to MFA, and access is both logged and generates an alert. (Once more this means that such access can't happen under the radar).

(CS access to a Canary Console)
In addition to this, some customers request that no Thinkst staff access their console. These customers have the back end authentication/MFA link broken. The means that Thinkst staff will not be able to authenticate to the customer console at all.

Third-party assessments

We've also had a crystal-box assessment performed of both the Canaries and the Console by one of the leading app-sec teams in the business. A copy of their report is available on request, but their pertinent, summarising snippet is:

"The device platform and its software stack (outside of the base OS) has been designed and implemented by a team at Thinkst with a history in code product assessments and penetration testing (a worthy opponent one might argue), and this shows in the positive results from our evaluation.
Overall, Thinkst have done a good job and shown they are invested in producing not only a security product but also a secure product."

Wrapping up

So, is Canary an impossible target? Of course not, it's why we wrote "safer designs" above, not "safe designs". 

But we have put a lot of thought into making sure we don't introduce vulnerabilities to a customer network. We've put tons of effort into making sure that we limit the blast radius of any problem that does show up. And if a bird can get off just one warning before it's owned, it's totally lived up to its namesake and earned its keep...

HackWeek 2019

Last week team Thinkst downed tools again for our bi-annual HackWeek. The rules of HackWeek are straightforward:
  • Make Stuff;
  • Learn;
  • Have fun.
We discussed HackWeek briefly last year:
Our HackWeek parameters are simple: We down tools on all but the most essential work (primarily anything customer-facing) and instead scope and build something. The project absolutely does not have to be work-related, and people can work individually or in teams. The key deadline is a 10-minute demo on the Friday afternoon. The demos are in front of the rest of the team, and results count more than intentions.
We pride ourselves on being a "learning organization" and HackWeek is one of the things that help make that happen. It's always awesome seeing a software-developer solder their first board or seeing someone non-technical write their first lines of python.

Project highlights this year: 

Az used the SimH simulator to run an obscure Soviet Mainframe (the BESM-6):


Eventually, he had the mainframe pushing the keys on a Pokemon game running in a simulator using Fortran (because, of course!). Along the way he had to deal with Russian manuals and, uh, learning Fortran.


Mike built "Incubator" to manage our stock of Canary raw materials:


Riaan threw in a physical hack to make sure fewer cars were scratched when parking in the basement, and built a physical status monitor for our support queues:


Keagan decided to combine ModSecurity hackery & testing to add in extra protection onto our new flocks consoles:


Haroon took a crack at some d3 fiddling to create art (and inspectable graphs) with our customer logos but sadly this can't be shown :)


Quinton used an Arduino and some jury rigged hardware to keep better tracking of scores for the indoor cricket games held in the Jhb office:


Jay used the incredible work by the openDrop people to create a fake Airdrop service on our Canaries.

So, configure it through our Canary console:


Once the bird loads, it becomes visible to people in its vicinity using Airdrop on their Macs or iPhones:


After an attacker submits a file, the Canary alerts as usual:


Donovan flirted with Flask and Python to make another interface to download Canarytokens.

Danielle dived into Verilog to get her Quartus II FPGA to voice-print individuals:



Marco embedded draw.io into our Phabricator setup to allow us phriction-phree-phlowcharting:


Max broke out Unity to build a game for the Occulus:



Matt wrote a game for his Nintendo switch:


Bradley attempted to give Apple designers aneurysms by affixing a travel LCD to his laptop for a MacGyver'd screen extender:


Nick and Anna paired up to create a hardware/software combo. They used RaspberryPi's, a pack of blank credit cards, stepper-motors and toothpicks? to create a 9-digit split-flap display for the CapeTown office.

 

(I would have totally given it the prize for "most soothing sound made by any HackWeek project, ever".)

Adrian combined the Canary API and his nostalgia for CLI interfaces to make a lo-fi Canary Console:


Yusuf built an app/bot that could be summoned on Twitter to compile tweet-storms to blog posts (and learned the harsh lessons of unforgiving HackWeek deadlines.)


"A fun time was had by all" (tm)

Canary Alerts, Part 2 - Bonus Flavours

Canaries and Canarytokens are tripwires that can alert you to intrusions. When alerts trigger, we want to make sure you get them where you need them. While our Slack integration is cool, you might prefer to send alerts through your SIEM. Or to a security automation tool. Maybe you want to leverage our API to integrate Canary alerts into a custom SOC tool. Want to turn a smart light bulb red and play the Imperial March? You could do that too.
IFTTT screenshot of an applet that makes a light blink when a Canary alert is received

Your way or the highway

We often puzzle at products that require customers to totally revamp how they do things. We never presume to be the most important tool in your toolbox, which is why our product is designed to be installed, configured, and (somewhat) forgotten, in minutes. We’d rather disappear into your existing workflow, only becoming visible again when you need us most.

Our customers dictate where and how they see our alerts. To enable this, we provide a wide variety of flexible options for sending and consuming alerts.

By default, you’ll get alerts on your console...

In your email…


...and as a text message.

And that’s not all…

For those of you wondering where the SIEM love is at, don’t worry. We can send syslog where you need it, as secure as you need it. A quick email to support@canary.tools with the details for your syslog endpoint will get the logs flowing in no time.


For Splunk fans, we have a Splunk app that works with both Splunk Enterprise and Splunk Cloud. Details on installing and configuring the Splunk app can be found in our help documentation.

Email can also be an easy way to integrate Canary alerts with other tools. For example, most task and ticket management systems support creating tickets or tasks with an email. ServiceNow, BMC Remedy are common in large enterprises, but what about something simpler, with a free use plan? Something you could set up in minutes, like a Canary?

Build a SOC dashboard in 5 minutes, for free

We’re going to use Trello as an example of how flexible email can be for alert integration.

It turns out, Trello aligns well with the spirit of simple, fast and ‘just works’. Finding the custom email address that allows new card creation takes just a few clicks. Then, paste it in the email notifications list in your console settings and you’re good to go. Canary alerts will start showing up in Trello on the board and list you chose to attach the Trello email to.


A simple three-list configuration should work for basic alert triage: new alerts, acknowledged (being worked) and completed.


Any Canaries or Canarytokens triggered will result in a new card dropping into the New Alerts column immediately. Drag the card over to the Ack column and assign it to someone and Trello can notify them (based on your Trello configuration). Each card contains the full content of the alert and supports comments and attachments.

Once the investigation is complete, the card can be dragged over to the final column.

And, of course, an API

Anything you can do or view in the Canary console can be done via our fully documented API. It’s possible to control Canaries, create Canarytokens, view alerts, manage alerts and much more. Following is a simple bash script demonstrating how to grab a week’s worth of alerts and dump them into a spreadsheet-friendly format (CSV). Also available as a gist.

#!/bin/bash
# Create a CSV with the last week's worth of alerts from your Canary console
# Requires curl and jq to be in the path

# Set this variable to your API token
export token=deadbeef12345678

# Customize this variable to match your console URL
export console=ab123456.canary.tools

# Date format (one week ago)
export dateformat=`date -v-1w "+%Y-%m-%d-%H:%M:%S"`

# Filename date (right now)
export filedate=`date "+%Y%m%d%H%M%S"`

# Complete Filename
export filename=$filedate-$console-1week-alert-export.csv

# Base URL
export baseurl="https://$console/api/v1/incidents/all?auth_token=$token&shrink=true&newer_than"

# Run the jewels
echo Datetime,Alert Description,Target,Target Port,Attacker,Attacker RevDNS > $filename
curl "$baseurl=$dateformat" | jq -r '.incidents[] | [.description | .created_std, .description, .dst_host, .dst_port, .src_host, .src_host_reverse | tostring] | @csv' >> $filename


Taking Flight

Like everything else Canary-related, alerts should be dead simple and easy to work with. Though alert volumes from Canaries are incredibly low (customers with dozens of Canaries report just a handful of alerts per year) we include a bunch of options to cover everything from common requests to esoteric requirements.

If you have any clever ideas on integrating alerts or consuming them, we’d love to hear them! Drop us a message on Twitter @ThinkstCanary or via email, support at canary dot tools.

Alerts Come in Many Flavours

‪If you force people to jump through hoops to handle alerts, they’ll soon stop doing it 🤯‬
‪Canary optimizes for fewer alerts but we also ensure that you can handle alerts easily without us.‬ ‪So it takes just 4 minutes to setup a Canary but far less to pull our alerts into Slack‬.

By default, your console will send you alerts via email or SMS, but there are a few other tricks up its sleeve. It is trivial to also get alerts via webhooks, syslog or our API.

This post will show you how to get alerts into your Slack. The process is similar for Microsoft Teams and other messaging apps that use webhooks for integration. It’s quick, painless and super useful.

(This post is unfortunately now also bound to be anti-climactic - it’s going to take you longer to read this than to do the integration).


Did you know how easy this can be?
The Canary Console can integrate with Microsoft Teams and Slack in seconds and with a few more steps, can integrate with any other webhook-friendly platform. The process is similar for most platforms, but here’s how it looks for Slack.

  1. Enable Webhooks in your Canary Console settings.
  2. Click Add to Slack, choose the channel to drop alerts into and click Allow
  3. That’s it! You now have Canary alerts showing up in Slack. Elapsed setup time? About 30 seconds.


Now that you’ve got Canary alerts integrated into Slack, you can actually interact with them. When an alert shows up in Slack, you’re given an option to mark it as ‘seen’, which removes it from the queue of unacknowledged alerts.

You can even permanently delete it from inside Slack - no need to even log into the console. Here’s a peek at what the process looks like.

Why we’re so keen to get alerts out of the console

You’ve got enough consoles already. Heck, you may even have multiple "single panes of glass". We’re not interested in adding our console to the already long list of security tools to check on a daily or hourly basis. We realise and deeply understand that it’s not about us, it’s about you. That’s why we make it so easy to pull Canary alerts into your existing workflows.

Live in Slack? We’ll alert you there.
Live on your phone? We’ll text you.
Live in Outlook? We’ll drop you an email.
Want all-of-the-above, just in case? We can do that too.

I'm Running Canaries, but...

...what if someone finds out?

Do attackers care if there are canaries in my network?

People wonder if they need to hide the defensive tech used on their networks. Like all interesting dilemmas, the answer is nuanced.


In defense of obscurity

In any discussion about obscurity you will almost certainly have someone shout about “security through obscurity” being bad. As a security strategy, obscurity is a terrible plan. As an opportunity to slow down or confuse attackers, it’s an easy win. Every bit of information an attacker has to gather during a campaign gains the defender time.

This is very much a race against time. No breach happens the moment a shell is popped or SQL injection is discovered. Attackers are flying blind and must explore the environments they’ve broken into to find their target. Defenders can seize the opportunity to stop an incident before it becomes a breach.

It is often true that attackers typically operate with a fuller view of the chessboard than defenders. However, when environments are running with defaults, they meet attackers' expectations. Defenders who are able to introduce unexpected defenses or tripwires to this chessboard can turn this asymmetry to their advantage.


What are defenders so afraid of?

Defenders tend to be concerned that their security products:
1. could, themselves, be insecure
2. may not work as expected when attacked
3. could possibly be evaded if attackers are aware of them
4. will simply eat labor without producing much value

Pardon the pun, but this isn’t a very defensible position to be in.

We know very well from Tavis Ormandy, Joxean Koret, Veracode, and others that security software and products are notoriously insecure. According to Veracode, in fact, they come in next-to-last place.


If that’s not discouraging enough, the average security product is difficult to configure, challenging to use and requires significant resources to run and maintain. There is no shortage of reasons for wanting to hide the details of security products in use.

The Importance of Resilience

Let’s consider the flipside for a moment: offensive tools and capabilities. There’s a solid argument for keeping offensive capabilities secret. For example, the zero-day vulnerabilities used by Stuxnet wouldn’t have been as effective if they had been previously reported and patched. For some time, military aircraft have had advantages because details of their capabilities or even their very existence were closely guarded secrets.

Defenses are a very different case, however. These must stand the test of time. They are often visible to outsiders and similar to defenses used by other organizations. Vendors, after all, will advertise their products in order to sell them. Defenses need to hold up under close scrutiny and be robust enough to last for years without needing to be replaced. The argument for keeping them secret could perhaps slow down an attacker but not by an appreciable amount. 

Ultimately, defenses need to work regardless of whether attackers are aware of their presence.

Attackers Discover Your Secret: Canaries

It’s okay - we’ve planned for this moment. We spent significant effort ensuring Canaries are unlikely to ever be the ‘low hanging fruit’ on any network. We’ve also made architecture choices that minimize blast radius should a Canary ever be exploited (e.g. we won’t span VLANs, ever). In short, compromising a Canary would be very difficult and will never improve an attacker’s position.

With a direct attack against a Canary unlikely to prove useful, let’s look at the attacker’s remaining options.

Scenario 1: The attacker has no idea you’ve deployed Canaries and Canarytokens. Since they’re not expecting honeypots, they’re less concerned with being noisy. They’re likely to trip alerts all over the place, as they run scans and attempt to log into interesting-looking devices.

Scenario 2: The attacker knows you use Canaries, but they’re flying blind. Even though they know honeypots are in use, they don’t know which are real and which are fake. This presents them with a dilemma - being sneaky is a lot more work, but they still need some way of exploring the network without triggering alerts. It’s likely to be in the attacker’s best interest to find a different target.


An unexpected bonus we never planned for is that Canaries are super scalable. Many customers start with five or ten and grow to dozens or hundreds. Stepping back into the attacker’s shoes - are you on a network with five or five hundred? Has this organization deployed a hundred Canarytokens or a million?

Conclusion

The underlying principle is a shift in thinking. Defeatist phrases like, “it’s not a matter of if, but when you get breached” have discouraged defenders. The reality is that the attacker is typically coming in blind, while the defender has control over the environment. By setting traps and tripwires, the defender can tip the outcome in their favor.

We think it’s a very positive and empowering change for the defender mindset. It’s your network - own it and rig the game.



Introducing Rapsheet

We've got hundreds of servers and thousands of Canaries deployed in the world. Keeping them healthy is a large part of what we do, and why customers sign up for Canary. Monitoring plays a big role in supporting our flocks and keeping the infrastructure humming along. A pretty common sight in operations are dashboards covered with graphs, charts, widgets, and gizmos, all designed to give you insight into the status of your systems. We are generally against doing things “just because everyone does it” and have avoided plastering the office with “pew-pew maps” or vanity graphs.

(although the odd bird-migration graph does slip through)

As with most ops related checks, many of ours are rooted in previous issues we've encountered. We rely heavily on DNS for comms between our bird and consoles, and interruptions in DNS are something we want to know about early. Likewise, we want to ensure each customer console (plus other web properties) are accessible.

There are tools available for performing DNS and HTTP checks, but our needs around DNS are somewhat unusual. We want to be able to quickly confirm whether each of our consoles is responding correctly, across multiple third party DNS providers (e.g. Google, Cloudflare, OpenDNS). For a handful of domains that's scriptable with host, but for many hundreds of domains this becomes an issue if you want to be able to detect issues fairly quickly (i.e. within tens of seconds of the failure).
To plug this gap we built Rapsheet, to give us a “list of crimes” of our systems against intermediary network service providers. In this post I'll provide a quick run through of why we built it.

Goal: "zero measurements"

In this post, I am going to dive a little deeper into the thinking behind Rapsheet and why the dashboard behaves in the way it does. Why we aim for zero measurements to be the goal and what this actually means. (Hint: it doesn't mean "take no measurements".)
Much of the thinking was expounded on by this awesome Eric Brandwine AWS reinvent talk:


If you have not yet watched it, you really should. Eric is whip smart and is an excellent presenter.
The key takeaway for this post is that alarms and dashboards will often lead to Engineers and other technicians developing what is known as “alarm deafness”. Primarily studied in the medical world, it describes the situation where operators rely on a raft of metrics with tunable parameters. If the parameters are too tight, the operators learn to ignore the alarm. If they’re too loose, bad things happen without anyone being the wiser. Alarm deafness grows when a specific check or metric is constantly in an “alerting” state. Eric points out that the best way to alleviate alert deafness, is to constantly be striving for “a zero measurement”, because as soon as you see anything other than a zero measurement, you know that action is required.

If you can find those zero measurement metrics then they provide clear tasks for the operations folks, for whom a key objective is to keep all panels in a non alerting state (i.e. zero measurement). With our current setup, I will drop most things I am busy with whenever I see one of the panels in any colour other than green.

A counter example of an actionable measurement is almost anything which provides a count or tally of an expected conditions (even if it’s an error.) For example, we rely on DNS for a range of services and are able to track DNS failures across multiple providers. A poor metric would be a graph displaying how many DNS queries have failed within the last few hours against any of the DNS servers. This graph may look interesting but nothing about the graph would lead to an exact action that engineers could take. Transient network issues lead to queries failing every so often and our engineers aren't expected or authorised to fix the backbone network between our hosting site and the 3rd party DNS servers.



Instead we only track which of our servers are not currently responding correctly to our various DNS queries. With this metric, it becomes a lot easier for us to determine patterns, and therefore understand the root cause of the problems that are responsible for the failing DNS queries. For example, we rely on Rapsheet to tell when a particular DNS provider is experiencing issues, or if a particular server is experiencing DNS issues, or whether one of our domains has been flagged as unsafe. This then leads to the issues being mitigated and resolved more timeously.

Architecture

At it's core it's straight forward. A long running processes uses asyncio to periodically perform a whole bunch of network tests (DNS, HTTP), pumps the results into InfluxDB and then we build a simple UI in Grafana.

The tests are tied together based on a customer, so we can quickly determine if an issue is customer-specific or more general.

Rapsheet has modules to implement each of its tests. Current modules include DNS checks against four different DNS providers. Different HTTP health checks including reachability, served content and endpoint checks against site blacklists such as Google’s SafeBrowsing. All endpoints are asynchronously checked and the results collated before posting the metrics into Grafana.  

Each time a new panel is added to the dashboard it has gone through a number of iterations on the backend in order to ensure that it keeps to this mindset.


It’s built to be extensible, so we can quickly add new zero-based measurements we want to track. 

A recent example was exception counting. We have as a design goal the fixing of unhandled exceptions. The dashboard has an “exception monitoring” panel that tracks the number of unhandled exceptions across all of our deployed code (on customer consoles and our own servers). Coming back to the notion of focus, it becomes a very clear goal for me: handle the exceptions that are cropping up. Most of the time it involves a deep dive into what is causing the exception and how we should best mitigate those cases. When introduced, the panel would get tripped up with noisy but harmless exceptions (and those panels would just burn red). After a flurry of root cause fixes, and thanks to the goal of driving them to zero, we only get a couple every other day across hundreds of production systems (and work is under way to alleviate those too).

Rate-limiting

Some of the panels involve performing requests or queries against third party services. For example, we want to know if someone using the Chrome browser is able to login to their console.
In order to "play well with others" and use the third party service fairly, Rapsheet has rate limits (especially since asyncio can push out lots of traffic). Most third party services let you know what these rate limits are so they can be implemented correctly. Interestingly, Google's DNS service doesn’t. Only by practically testing the limits did we figure out what type of rate we should limit our queries against their DNS service.

Location, location, location

Almost all of Canary services run inside AWS. It therefore made sense to move the monitoring out of AWS. We didn’t want to fall victim to a situation where everything looked good to tests inside AWS, but due to AWS issues were silent to the outside world. So we run Rapsheet at a wholly independent provider.

So far, it's proved quite adept at picking up issues in this hosting provider's network...

Wrap-up

The dashboard is being projected onto a wall in the operations office which is the main thoroughfare from our desks to the office facilities, so anyone working in the office has to see it when leaving their desks.


(Of course, it’s online for remote workers who are able to keep it open on their desktops if needed.)

Rapsheet is a work in progress, but helps us make sure that things are ticking along as expected.