I'm Running Canaries, but...

...what if someone finds out?

Do attackers care if there are canaries in my network?

People wonder if they need to hide the defensive tech used on their networks. Like all interesting dilemmas, the answer is nuanced.


In defense of obscurity

In any discussion about obscurity you will almost certainly have someone shout about “security through obscurity” being bad. As a security strategy, obscurity is a terrible plan. As an opportunity to slow down or confuse attackers, it’s an easy win. Every bit of information an attacker has to gather during a campaign gains the defender time.

This is very much a race against time. No breach happens the moment a shell is popped or SQL injection is discovered. Attackers are flying blind and must explore the environments they’ve broken into to find their target. Defenders can seize the opportunity to stop an incident before it becomes a breach.

It is often true that attackers typically operate with a fuller view of the chessboard than defenders. However, when environments are running with defaults, they meet attackers' expectations. Defenders who are able to introduce unexpected defenses or tripwires to this chessboard can turn this asymmetry to their advantage.


What are defenders so afraid of?

Defenders tend to be concerned that their security products:
1. could, themselves, be insecure
2. may not work as expected when attacked
3. could possibly be evaded if attackers are aware of them
4. will simply eat labor without producing much value

Pardon the pun, but this isn’t a very defensible position to be in.

We know very well from Tavis Ormandy, Joxean Koret, Veracode, and others that security software and products are notoriously insecure. According to Veracode, in fact, they come in next-to-last place.


If that’s not discouraging enough, the average security product is difficult to configure, challenging to use and requires significant resources to run and maintain. There is no shortage of reasons for wanting to hide the details of security products in use.

The Importance of Resilience

Let’s consider the flipside for a moment: offensive tools and capabilities. There’s a solid argument for keeping offensive capabilities secret. For example, the zero-day vulnerabilities used by Stuxnet wouldn’t have been as effective if they had been previously reported and patched. For some time, military aircraft have had advantages because details of their capabilities or even their very existence were closely guarded secrets.

Defenses are a very different case, however. These must stand the test of time. They are often visible to outsiders and similar to defenses used by other organizations. Vendors, after all, will advertise their products in order to sell them. Defenses need to hold up under close scrutiny and be robust enough to last for years without needing to be replaced. The argument for keeping them secret could perhaps slow down an attacker but not by an appreciable amount. 

Ultimately, defenses need to work regardless of whether attackers are aware of their presence.

Attackers Discover Your Secret: Canaries

It’s okay - we’ve planned for this moment. We spent significant effort ensuring Canaries are unlikely to ever be the ‘low hanging fruit’ on any network. We’ve also made architecture choices that minimize blast radius should a Canary ever be exploited (e.g. we won’t span VLANs, ever). In short, compromising a Canary would be very difficult and will never improve an attacker’s position.

With a direct attack against a Canary unlikely to prove useful, let’s look at the attacker’s remaining options.

Scenario 1: The attacker has no idea you’ve deployed Canaries and Canarytokens. Since they’re not expecting honeypots, they’re less concerned with being noisy. They’re likely to trip alerts all over the place, as they run scans and attempt to log into interesting-looking devices.

Scenario 2: The attacker knows you use Canaries, but they’re flying blind. Even though they know honeypots are in use, they don’t know which are real and which are fake. This presents them with a dilemma - being sneaky is a lot more work, but they still need some way of exploring the network without triggering alerts. It’s likely to be in the attacker’s best interest to find a different target.


An unexpected bonus we never planned for is that Canaries are super scalable. Many customers start with five or ten and grow to dozens or hundreds. Stepping back into the attacker’s shoes - are you on a network with five or five hundred? Has this organization deployed a hundred Canarytokens or a million?

Conclusion

The underlying principle is a shift in thinking. Defeatist phrases like, “it’s not a matter of if, but when you get breached” have discouraged defenders. The reality is that the attacker is typically coming in blind, while the defender has control over the environment. By setting traps and tripwires, the defender can tip the outcome in their favor.

We think it’s a very positive and empowering change for the defender mindset. It’s your network - own it and rig the game.



Introducing Rapsheet

We've got hundreds of servers and thousands of Canaries deployed in the world. Keeping them healthy is a large part of what we do, and why customers sign up for Canary. Monitoring plays a big role in supporting our flocks and keeping the infrastructure humming along. A pretty common sight in operations are dashboards covered with graphs, charts, widgets, and gizmos, all designed to give you insight into the status of your systems. We are generally against doing things “just because everyone does it” and have avoided plastering the office with “pew-pew maps” or vanity graphs.

(although the odd bird-migration graph does slip through)

As with most ops related checks, many of ours are rooted in previous issues we've encountered. We rely heavily on DNS for comms between our bird and consoles, and interruptions in DNS are something we want to know about early. Likewise, we want to ensure each customer console (plus other web properties) are accessible.

There are tools available for performing DNS and HTTP checks, but our needs around DNS are somewhat unusual. We want to be able to quickly confirm whether each of our consoles is responding correctly, across multiple third party DNS providers (e.g. Google, Cloudflare, OpenDNS). For a handful of domains that's scriptable with host, but for many hundreds of domains this becomes an issue if you want to be able to detect issues fairly quickly (i.e. within tens of seconds of the failure).
To plug this gap we built Rapsheet, to give us a “list of crimes” of our systems against intermediary network service providers. In this post I'll provide a quick run through of why we built it.

Goal: "zero measurements"

In this post, I am going to dive a little deeper into the thinking behind Rapsheet and why the dashboard behaves in the way it does. Why we aim for zero measurements to be the goal and what this actually means. (Hint: it doesn't mean "take no measurements".)
Much of the thinking was expounded on by this awesome Eric Brandwine AWS reinvent talk:


If you have not yet watched it, you really should. Eric is whip smart and is an excellent presenter.
The key takeaway for this post is that alarms and dashboards will often lead to Engineers and other technicians developing what is known as “alarm deafness”. Primarily studied in the medical world, it describes the situation where operators rely on a raft of metrics with tunable parameters. If the parameters are too tight, the operators learn to ignore the alarm. If they’re too loose, bad things happen without anyone being the wiser. Alarm deafness grows when a specific check or metric is constantly in an “alerting” state. Eric points out that the best way to alleviate alert deafness, is to constantly be striving for “a zero measurement”, because as soon as you see anything other than a zero measurement, you know that action is required.

If you can find those zero measurement metrics then they provide clear tasks for the operations folks, for whom a key objective is to keep all panels in a non alerting state (i.e. zero measurement). With our current setup, I will drop most things I am busy with whenever I see one of the panels in any colour other than green.

A counter example of an actionable measurement is almost anything which provides a count or tally of an expected conditions (even if it’s an error.) For example, we rely on DNS for a range of services and are able to track DNS failures across multiple providers. A poor metric would be a graph displaying how many DNS queries have failed within the last few hours against any of the DNS servers. This graph may look interesting but nothing about the graph would lead to an exact action that engineers could take. Transient network issues lead to queries failing every so often and our engineers aren't expected or authorised to fix the backbone network between our hosting site and the 3rd party DNS servers.



Instead we only track which of our servers are not currently responding correctly to our various DNS queries. With this metric, it becomes a lot easier for us to determine patterns, and therefore understand the root cause of the problems that are responsible for the failing DNS queries. For example, we rely on Rapsheet to tell when a particular DNS provider is experiencing issues, or if a particular server is experiencing DNS issues, or whether one of our domains has been flagged as unsafe. This then leads to the issues being mitigated and resolved more timeously.

Architecture

At it's core it's straight forward. A long running processes uses asyncio to periodically perform a whole bunch of network tests (DNS, HTTP), pumps the results into InfluxDB and then we build a simple UI in Grafana.

The tests are tied together based on a customer, so we can quickly determine if an issue is customer-specific or more general.

Rapsheet has modules to implement each of its tests. Current modules include DNS checks against four different DNS providers. Different HTTP health checks including reachability, served content and endpoint checks against site blacklists such as Google’s SafeBrowsing. All endpoints are asynchronously checked and the results collated before posting the metrics into Grafana.  

Each time a new panel is added to the dashboard it has gone through a number of iterations on the backend in order to ensure that it keeps to this mindset.


It’s built to be extensible, so we can quickly add new zero-based measurements we want to track. 

A recent example was exception counting. We have as a design goal the fixing of unhandled exceptions. The dashboard has an “exception monitoring” panel that tracks the number of unhandled exceptions across all of our deployed code (on customer consoles and our own servers). Coming back to the notion of focus, it becomes a very clear goal for me: handle the exceptions that are cropping up. Most of the time it involves a deep dive into what is causing the exception and how we should best mitigate those cases. When introduced, the panel would get tripped up with noisy but harmless exceptions (and those panels would just burn red). After a flurry of root cause fixes, and thanks to the goal of driving them to zero, we only get a couple every other day across hundreds of production systems (and work is under way to alleviate those too).

Rate-limiting

Some of the panels involve performing requests or queries against third party services. For example, we want to know if someone using the Chrome browser is able to login to their console.
In order to "play well with others" and use the third party service fairly, Rapsheet has rate limits (especially since asyncio can push out lots of traffic). Most third party services let you know what these rate limits are so they can be implemented correctly. Interestingly, Google's DNS service doesn’t. Only by practically testing the limits did we figure out what type of rate we should limit our queries against their DNS service.

Location, location, location

Almost all of Canary services run inside AWS. It therefore made sense to move the monitoring out of AWS. We didn’t want to fall victim to a situation where everything looked good to tests inside AWS, but due to AWS issues were silent to the outside world. So we run Rapsheet at a wholly independent provider.

So far, it's proved quite adept at picking up issues in this hosting provider's network...

Wrap-up

The dashboard is being projected onto a wall in the operations office which is the main thoroughfare from our desks to the office facilities, so anyone working in the office has to see it when leaving their desks.


(Of course, it’s online for remote workers who are able to keep it open on their desktops if needed.)

Rapsheet is a work in progress, but helps us make sure that things are ticking along as expected.