HackWeek 2018

Two weeks ago we ran the second edition of our internal HackWeek, and it was fantastic. Last year’s event was great fun and produced projects we still use; going into this year’s HackWeek we anticipated a leveling up, and weren’t disappointed. We figured we’d talk a little bit about the week, and discuss some of the “hacks”.

Our HackWeek parameters are simple: We downtools on all but the most essential work (primarily anything customer-facing) and instead scope and build something. The project absolutely does not have to be work-related, and people can work individually or in teams. The key deadline is a 10-minute demo on the Friday afternoon. The demos are in front of the rest of the team, and results count more than intentions.

Everyone participated and everyone presented at the Friday demo, including sales, dev, support, back office and yours truly. We strive to keep Thinkst a learning organisation and this HackWeek is one way that we do it. For example, it’s great to see a salesperson taking their first steps in writing Python, and our HackWeek helps make that happen. Here’s a roundup of a few of the notable submissions.

Portable Demo Kit
Bradley showed an early diversion into hardware hacking with his jury-rigged demo station. We often demo Canary over WebEx/GoToMeeting, and he decided to spend his HackWeek upgrading the current webcam setup.

He removed a camera from a non-functioning laptop, added some LED’s for lighting, attached both to a single USB cable, and then kept iterating on packaging until he had a tiny unit that hides in a pocket, but sets up for great overhead shots.
It appears to have cost his kids a few toy arrows, but was totally worth it! Wish him luck getting home-rolled electronics through airport security...

Az was up next and blew us away with his OSQuery-like hack to make our back-end infrastructure data more queryable in real-time. It’s pretty neat, SQLite lets you write plugins to incorporate underlying data sources which look nothing like relational tables. The upshot of this project is that we can run SQL queries which go out and fetch data from our customer consoles using SaltStack, and perform standard actions like filtering and joins.
I’m hoping we write a CanaryQL blog post of in good time. Projection Central Anna used the week to claim a piece of our downstairs office wall. She started by projecting a simple web page on the wall which showed off our customer tweets, and then gradually iterated the complexity upwards.
Step 2 displayed a cool animated clock, Step 3 showed bird deployments, and Step 4 integrated a websockets based chat system (allowing people in the office to send messages that would now display on the projector). This is perfect for kicking off long running jobs that notify people downstairs when done. Part of what made this awesome is the fact that Anna never touched Python before HackWeek! She summarised her win early on with a John Gall quote I love:
Kinect Resurrection
Jay swapped projects midstream, and eventually went for a hack related to the Kinect. This meant resurrecting and saving an old device before creating an office facial recognition based IDS.
A Better MouseTrap We have a janky internal system to test sample SD Cards comprising of a series of Raspberry Pi’s and a terrible-looking breadboard. Marco decided to replace the breadboard rat’s nest with a custom circuit board, built in the office. This meant turning the office into a meth-lab and a lot of fails.

Of course in true Marco fashion, he prevailed, in time and under budget:
Canary-War Nick and Max teamed up to build a Unity3d based game they called Canary-War. They designed the characters from scratch in Blender and then built the game mechanics for a multiplayer game in a week. Pretty awesome..
Grafana meets IoT Danielle decided that Grafana dashboards that merely displayed data from IoT devices were too limiting, and hacked a module using MQTT & WebSockets to get bi-directional comms going with her IoT device. Since Grafana is designed to be uni-directional, this took some finagling.
Instapaper for Video My project was purely to scratch my own itch. I wanted a way to tag video links during the day, and to then have them magically saved on my iPad for later viewing.
I ended up with a Rube Goldberg machine called savemyvid.net. Essentially this lets me send a video link to an email address, which is then parsed by an EC2 server which downloads the video and adds it to my personal podcast. My iPad then subscribes and auto-downloads episodes for that podcast so the videos are there even if I’m on a plane with no connectivity.
I’ve extended this to make the system multi-user, so I’ll blog about this one separately too.
It’s probably enough to say “a fun time was had by all” and end it there, because if we can’t have hacker fun, then what is this all for any way? But there’s always more. Post the presentations, we noted at least the following points on our internal Slack:

(Ed's warning: cut & paste from internal slack)
  • Make sure we always give credit for stuff we use from other people. It breeds a type of academic honesty that’s important and clarifying, and gets us into the habit of more generally giving credit when it’s due.
  • We often talk about “being a learning org” and the HackWeek demos warmed my heart for it. Az said, “last time I missed the mark by doing A, so now I did B”. I also heavily changed from the last HackWeek. (Last year, I planned time for HackWeek and “work happened”, and I barely shipped. This time, work also happened, but I expected it, so I had cleared up personal time heavily and it gave me enough time to ship satisfactorily). Learning (from past mistakes) is what we do.
  • Why bother? Things like a HackWeek come and go and if you don’t stretch for it, there’s actually no perceptible difference to your life. In fact you quickly figure out that life is much easier if you don’t put yourself into stretch-needing situations. The reason for consciously doing it during an artificial one week sprint? Because you’re building those muscles; during a HackWeek, you’re not just building the new tech skills you bumped into, but also meta skills. Skills like knowing when to dive deep or when to walk, how to pick a date, commit and ship. It’s super trite, but ultimately, “we are what we repeatedly do”.

Making NGINX slightly less “surprising”

Dan Geer famously declared that security is “the absence of unmitigatable surprise”. He said it while discussing how dependence is the root source of risk, where increasing system dependencies change the nature of surprises that emanate from composed systems.  Recently, two of our servers “surprised” us due to an unexpected dependence, and we thought this incident was worth talking about. (We also discuss how to mitigate such surprises going forward).

Every Canary deployment is made up of at least two pieces. Canaries (hardware, VM or Cloud) that then report in to the customer’s dedicated console hosted in EC2. We’ve gone to great lengths to make sure that the code and infrastructure we run is secure, and we ensure that any activity on these servers that isn’t expected, is raised in the form of an alert.

A few weeks ago, this real-time auditing activity tripped an alert on a development server. Servers are either built as production servers, which have been tested and effectively have a frozen footprint, or as development servers which are used by our developers for testing. The anomalous activity triggered our incident response process, and the response team swung into action.

A quick check on the server showed that it was owned by an automated tool blasting through the AWS address space looking for a bunch of simple misconfigs and attack vectors. On this dev server, it found a debug interface at the path  /console. This debug console was foundational to the framework and was automatically included whenever the framework’s debug flag was enabled. Importantly, it’s served from deep within the framework, and introspecting the application’s internal routes didn’t show /console.

What a gift! The developer working on this console enabled debug mode without realizing its full implications, and the attacker’s script found it in time. The fault here was ours, we turned on debug mode in Flask, but our surprise came from the fact that we never expected our webserver to serve up pages we didn’t know about.

The immediate fix was to disable the debug flag and reboot the server, which killed the access. (We subsequently tore down all our dev servers, not because of signs of compromise, but because our tooling makes it trivial to launch new ones.) However we wanted to examine the pattern a little more closely, to see if we could reduce our unmitigatable surprises. If /console was present, what other surprises await us now, or in the future with new developments happening?

So we started looking at creating an whitelist generator for NGINX, which is the web server we rely on. What we had in mind was a stand-alone tool that would coax NGINX to only serve documents from known paths, with minimal effort and minimal impact to an existing setup.

1) What we tried that didn’t work
NGINX has a module which allows one to embed Lua scripts inside the config file. We explored this (because we really wanted an opportunity to play with Lua) but ultimately rejected it as Lua support isn’t part of most default NGINX packages. We’d have to build the NGINX package from scratch, which would create an additional operational burden, and so fails our minimal effort and impact goals. We then explored the custom-built NGINX Javascript module referred to as njs, which is a scripting module developed and supported by NGINX themselves. It installs on top of existing NGINX setups and is super cool and interesting, but also turned out to be too limited for our needs. (It essentially prevented us from being able to call out and inspect the Flask setup to learn about valid routes).

2) What we currently do
    1. Grab nginx_flaskapp_whitelister from our Github account;
    2. Run nginx_flaskapp_whitelister to generate a new include.whitelist file
    3. Include this file into your current NGINX config.

The skinny:
A typical Flask app has a url_map object which holds all of the routes used in the application. Assume we have a Flask app, with defined routes that look something like this:
 ‘/‘, ‘/login’, ‘/chat’ and ‘/admin’
Our url_map will look like this:

Now NGINX has a concept of whitelisting routes, using what they refer to as “location directives”. A simple pseudo-configuration will look something like this:

So the basic lookup sequence of NGINX, to determine what to serve up for a requested path is as follows:

    1. NGINX looks for exact matching routes (defined with ‘=‘), 
    2. It turns to the longest matching routes defined with prefixes (modifiers such as ‘^~’) 
    3. It will turn to its default lookup method as regular expression matches to the longest matching route. 

Thus by specifying locations using the ’=‘ and ‘^~’ modifiers, you are able to override the natural behaviour of route lookups.

We make use of this by extracting all rules/routes defined for our app (by grabbing it from the apps associated url_map object) and mangle this into a neatly bundled, bite-size chunk for NGINX. We then fetch the current NGINX config describing the '/' route of the running server (most likely serving the allowed current endpoints).

We then create a separate include.whitelist file with the following NGINX configuration:
If an endpoint is exactly equal to '/' or it is equal to any of the fetched Flask endpoints, pass it to the original NGINX config that was fetched,
For any other endpoints being requested return a 404. 

Including this whitelist file ahead of the location directive definitions ensures that the config written by the tool will take priority over possible conflicts without overriding or ignoring additional existing configuration oddities in the current setup.

So with a simple git clone and install:

…and a run of the tool:

…you are sure that even if a dev accidentally enables ‘/console’ again, it won’t be served, as you do not explicitly allow it. Instead you will be served up an NGINX 404 page (or if so defined in your NGINX configuration file, a custom 404 page - like we do!):

Example: for implementing the nginx_flaskapp_whitelister for your Flask application called app, that is defined in the file /module/flask.py and run from /path/to/python/virtualenv; you would run the following command:

3) How it fits into our pipeline
We use SaltStack for easing our configuration management, with their event-driven orchestration and automation through the use of configuration files referred to as ‘recipes’. Including our NGINX Flask App Whitelisting tool into our Salt recipes was quite simple: a recipe ensures the tool is present and installed, and the tool is run after the NGINX config file is deployed, but before the NGINX process is started.

4) Where you can find it
The nginx_flaskapp_whitelister tool is available on our Github page at https://github.com/thinkst/nginx_flaskapp_whitelister.

As trite as it sounds, nothing beats solid security design and multiple layers of detection. We were able to discover the compromise within minutes because all SYSCALLS on our servers are audited and exceptions are alerted on. The deployed architecture means that a single server compromise doesn’t leak anything to an attacker to let her target other servers and she has no preferential access because of her compromise. Now, with all of our servers running nginx_flaskapp_whitelister, there’s less chance of them surprising us too.

Edit: Also check out https://blog.eutopian.io/elephant-proofing-your-web-servers/ by @nickdothutton

Good Pain vs. Bad Pain

aka: You know it’s supposed to hurt, you just don’t know which kind of hurt is the good kind

One of the common problems when people start lifting weights (or doing CrossFit) is that they inadvertently overdo it. Why don’t they stop when it hurts? Because everyone knows it’s supposed to hurt. Hypertrophy is the goal, so the pain is part of the deal... right?

Pain, Guaranteed
In an old interview on the rise of Twitter, Ev Williams said something really interesting: in pursuit of the fabled startup we’ve gotten so used to praising the entrepreneurial struggle, and so often repeat the myth of the starving entrepreneur, that people tolerate the pain of a bad/unviable idea longer than they should. He said that seeing Twitter go viral, made it clearer how Odeo hadn’t.
When Twitter took off, he just about said: “So this is what traction feels like.”
This is an interesting problem. The cult of entrepreneurship is strong and there’s no shortage of glib one-liners pumping people up to worship the grind. In putting yourself through it, you could be building the ultimate beach body, or you could be setting yourself up for a lifetime of chronic back pain. So how can we tell the difference? How do you know if this is the necessary pain that all young companies endure or if you are actually giving yourself a digital hernia?

An obvious answer is customer feedback, but this isn’t that simple. I’ll discuss why, through the lens of two different products we built: Phish5 and Canary.

We built Phish5 in 2012 with the logic that network admins and security teams would be able to sign up, pay a few hundred dollars and run high-quality phishing campaigns against their own companies. It worked well and over the years had reasonable success. (By this I mean that it found customers all over the world, and it made a few hundred thousand dollars while costing a fraction of that to run). Over time, self-phishing became a bit of a cottage industry as more and more players entered the market. We still had some multinational customers using it so we kept the lights on, but didn’t invest too heavily in it..

In 2015 we released our Thinkst Canary. High-quality honeypots that would deploy in minutes and require almost zero management overhead. It took less than a month to realize that Canary was going to be different. Our early Phish5 sales were nearly always to people we knew (from our previous lives) while Canary almost instantly found customers we never knew in verticals we would never have explored.

Phish5 customers used the service until their license expired, and then (maybe) signed up for a 2nd round. Canary customers typically  add more Canaries to their networks part way through their subscription, upselling themselves in the process.
Most of all though, while we had Phish5 customers who told us they liked Phish5, Canary customers oozed “love”.
Paul Graham famously suggests that winning with a startup begins by “making something people love”. Part of the problem with this, is that like many tweeners, you can’t tell if it’s love when you’ve never been in love before.

We had users recommend Phish5 to their friends and it was even featured in an article or two. But it was only with Canary that we went: “Ooooh.. that’s what love feels like”. Email feedback was tangibly “gushy” and we’d increasingly hear Canary mentioned lovingly in security podcasts. The unsolicited feedback on Twitter was beautiful (and was much more than we could have asked for!)

We have our roots in the security research community and we work hard to push the boundaries with our products. Nothing shows love to a researcher, like other researchers citing (and building on) your work. Phish5 appeared in a few news pieces over the course of 5 years but the Canary family almost instantly slid into other people’s slide decks.

From short introductory vids to industry legends like CarnalOwnage discussing Canary usage in his day-job; from rock stars like Collin Mulliner stretching CanaryTokens for RE Detection, to smart folks like Mike Ruth talking about deploying Canaries at Scale. People (other than us) delivered talks and papers around our birds.. Nothing close happened with Phish5 (and to be honest, we didn’t know it happened ever).

The love delta is obviously reflected in our numbers too: With Canaries deployed on all 7 continents, more than 95% of our Canary sales are still inbound & word of mouth referrals. We’re not saying “We’ve won” or that Canary’s success is a fait accompli. But we do know that it’s on a radically different trajectory to anything we built before, and we wouldn’t have gotten here if we kept “grinding away” at Phish5.

Determination and focus are great, but make sure that your doggedness doesn’t stop you from ditching your Odeo to build your Twitter¹ .

¹ That wasn’t us.. That was Ev..  Check back with us in 5 years to see how it worked out for us

They see me rolling (back)

Moving backward is a feature too!

We go through a lot of pain to make sure that Canary deployments are quick and painless. It’s worth remembering that even though the deployment happened in minutes, a bunch of stuff has happened in the background. (Your bird created a crypto key-pair, exchanged the public key with your console, and registered itself as one of your birds).

From that point on, all communication between your bird and your console is encrypted (with a per-device key) and goes out via valid DNS requests. This makes sure that deployments are quick and simple, even on complex networks.

Once your bird is successfully deployed, it’s completely configurable via your Canary Console.
So with a few clicks, a user is able to change a deployed Canary from a Cisco Router, to a Windows Server

However mistakes happen and, as anyone who has remotely configured network interfaces over SSH can attest, remote network changes aren’t kind to missteps. How does your Canary react if you configure it with broken network settings? Your console will already warn you of certain obvious network misconfigurations that fail sanity checks. 

But what if someone enters settings that pass sanity checks but are simply wrong for the network in question? (For example, providing static IP settings which are incorrect for your Canary’s network location.)
Previously, this would simply mean that the Canary would apply the new settings, and promptly lose connectivity with the console as the IP settings aren’t valid. The fact that it could no longer reach the console, would mean that it couldn’t be “fixed” from the console, and some poor admin would need to trudge on over to the device to reconfigure it.

This sucks, so we introduced “Network Rollback”. If a customer applies a config that prevents the Canary from getting back to the Console, the Canary figures this out, and rolls back to its last known working settings.

Of course, we then give you a quick notification that “something bad happened”. This incident can also be sent via your regular notification channels, such as email, text message, syslog or Slack.

When you configure a bird which has recently been rolled back, you’ll get a warning too, so you know the settings have been rolled back.

We try as hard as possible to make sure people do the right thing by default, but when you get it wrong, we will try to give you a mulligan.

Some OpenCanary Updates

As a company, we are pretty huge fans of Open Source software. We use FLOSS extensively in our production stack and we make sure to give back where we can. One of the ways we do this, is by making our Canarytokens & OpenCanary projects open source and free to download.

People needing Canarytokens can use the free hosted instance we run at Canarytokens.org, or they are free to download the docker images to run on their own networks. Literally hundreds of thousands of tokens have been generated online and the docker images have been pretty widely deployed too.

Our paid Canary customers get their own hosted tokens server, the ability to trivially customize it, as well as some tokens that have not been ported over to the free-server yet.

The relationship between OpenCanary(left) and Canary(right) is less clear.

Marco and I recently spoke at LinuxConfZa where we discussed this. In the buildup to the talk, we added some new features to OpenCanary (mostly by backporting features from our paid Canary service) which we’d like to show you below:

What’s new


We have added to the capability of OpenCanary’s portscan module. We now have detection for specific nmap portscans: nmap FIN, NULL, XMAS and OS scans. This means that OpenCanary has multiple event types for a portscan which would indicate which type of scan has been targeted at your OpenCanary sensor.

We make this happen in the background by adding some iptables rules. These new rules match specific packets based on what we can expect from different nmap scans. This then creates specific logs with chosen prefixes giving us a number of new, defined logtypes.

Using our usual portscan mechanism of monitoring this log file, we can start to alert based on the log’s prefix. If the log’s prefix is `nmapNull`, we know that the log was generated by iptables’ match rule associated with nmap’s NULL scan. And this process is repeated for all newly supported portscan dectections. Below is an example of an FIN scan by nmap,

Followed by what we would expect to see being logged by our OpenCanary.

As you can see, logtype 5005 which we already saw is tied to a nmap FIN scan event.

New Services

We have also added three new services that your OpenCanary can now emulate.

The first service is Redis. Redis “is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker”. Our Redis service fakes a full fledged Redis server - alerting on connect and on any subsequent commands sent to your OpenCanary.

The above image shows that the Redis service mimics the behaviour closely by returning errors for Redis commands that have not been given the correct number of arguments or our Redis server will return an authentication required error for correctly crafted Redis commands.

And here we can see the logs generated by our previous Redis-related indiscretions: the command we tried along with their additional arguments.

The second service is Git. Git is a version-control system used (mainly) for code repositories. Hence you can run a Git server that would allow for collaboration across organisations. Our Git service fakes a Git server - alerting on `clone` git commands which means someone is trying to get that repository of code.

Here we have tried to clone a repository by attempting to connect to the git service being run by OpenCanary. The interaction mimics that of the actual Git service and errors out in an expected way.

And in our OpenCanary logs, we will get the repository name that the attacker was looking for. This is helpful in understanding where your leak may be.

The last new service is actually a potential factory of hundreds of TCP services. It’s a service to build generic TCP banners. This allows you to create a TCP listener on an arbitrary port. You can select a banner to greet your client on connect, and can control subsequent responses sent to the client. This allows you to quickly model plaintext protocols.

In this example, we have set up our TCP banner service to mimic an SMTP server. We connect to the service and receive the first banner. When the client sends through more data to interrogate the service, she receives our data received banner.

We then alert and log on the full interaction creating a great audit trail for investigation.

General Cleanup

We have also heard the cries (we are sorry!) and have touched up some configuration documentation on opencanary.org along with our brand new OpenCanary logo.

Setting up an OpenCanary instance should be as simple as following the instructions over here.

Upgrading is dead simple too. Simply run `pip install opencanary --upgrade`. It will upgrade the version of your OpenCanary, but keep your old config. You can check out the new config file options over here.

Take it for a spin, you won’t be disappointed.

(We love Pull Requests and have been rumoured to respond with 0day Canary swag).

Ps. If you want honeypots that deploy in minutes, with 0 admin overhead, you really should check out our Thinkst Canary. It’s one of the most loved devices in infosec, and you will see why..

(Better) Canary Alerts in Slack

One of the things that surprise new Canary customers, is that we don't try particularly hard to keep customers looking at their consoles. (In fact, an early design goal for Canary was to make sure that our users didn't spend much time using our console at all).

We make sure that the console is pretty, and is functional but we aren't trying to become a customer's "one pane of glass". We want the Canaries deployed and then strive to get out of your way. You decide where your alerts should go (email, SMS, API, webhooks, Syslog, SIEM app), set up your birds, and then don't visit your console again until a Canary chirps..

We have hundreds of customers who never login to their consoles after the initial setup, and we're perfectly happy with this. Their alerts go to their destination of choice and that's what matters. Of these, dozens and dozens of customers rely heavily on getting their alerts piped into a Slack channel of their choice.

Getting your alerts into Slack is trivial:

  1. Create a channel in Slack
  2. Go to Setup, Webhooks, and select "Add Slack XXX"
  3. Select the channel you want your alerts to go to;
  4. (Thats it! Your Slack integration is done!)

Until recently, alerts that went into Slack were simple one way traffic, containing incident details.

While this suffices for most users, recently, Max and Jay sat down to make this even better. Alerts into Slack now look like this:

You'll notice that, by default, potential sensitive fields like passwords are now masked in Slack. This can be toggled on your Settings page. We're also including additional historical context to assist your responders.

Best of all though, you can now manage these alerts (Mark as seen and Delete) from right inside Slack, so you never have to login to your Console.

Once an event has been acknowledged, the incident details will be visually "struck", and a new field will indicate the name of the person who ack'd it.

Clicking "Delete" will then collapse the now superfluous details, and will track the name of the deleting user.

So.. if your security team is using Slack, consider using the integration. It will take just seconds to set up, and should make your life a little easier.

A Week with Saumil (aka "The ARM Exploit Laboratory")

Last month we downed tools for a week as we hosted a private, on-site version of the well regarded “ARM Exploit Laboratory” (by Saumil Shah). The class is billed as “a practical hands-on approach to exploit development on ARM based systems” and Saumil is world respected, delivering versions of the class at conferences like 44con, Recon and Blackhat for years.


With a quick refresher on ARM assembly and system programming on day-1, by day-2 everyone in the class was fairly comfortable writing their own shellcode on ARM. By the end of day-3 everyone was comfortable converting their payloads to ROP gadgets and by day-4 everybody had obtained reverse shells on emulated systems and actual vulnerable routers and IP-Cameras. Without any false modesty, this is due to Saumil's skill as an educator much more than anything else.

Pre-Class Preparation

While our Canary is used by security teams the world over, many people in the team have backgrounds in development (not security) so we felt we had some catching up to do. A few months before the class, we formed an #arm-pit slack channel and started going through the excellent Azeria Labs chapters and challenges. (It’s worth noting that Saumil's class managed to work for people on the team that were not taking part in our weekly #arm-pit sessions, but those of us who did the sessions were glad that we did anyway).

A special shout out to @anna who didn’t actually attend the ARM exploitation sessions but made sure that everything from food and drinks, to conference room and accomodation were all sorted. An echo that great preparation made for a great experience. Thank you @anna.

The Class

We’ve all sat in classes where the instructor raced ahead and knowledge that we thought we had proved to be poorly understood when we needed to apply it. As the course progressed, each new concept was challenged with practical exercises. Each concept needed to be understood as the following concepts (and exercises) would largely build on the prior knowledge. And in this fashion, we quickly weeded out gaps in our knowledge because practically we could not apply something we
did not understand.

The addition of shellcode-restrictions (and processor mitigations) tested a particular way of thinking which seemed to come more naturally to those of us with a history of “breaking” versus “building”. The breakers learned some new tricks, but the builders learned some completely new ways of thinking. It was illuminating.

The class was choc-full of other little gems, from methodologies for debugging under uncertainty to even just how slickly Saumil shares his live thoughts with his students via a class webserver (that converts his ascii-art memory layout diagrams to prettier SVG versions in real time)

It’s the mark of an experienced educator who has spotted the areas that students have struggled with, and has now built examples and tooling to help overcome them. We didn’t just learn ARM exploitation from the class, it was a master class in professionalism and how to educate others.

Where to now?

A bunch of people now have a gleam in their eyes and have started looking at their routers and IOT devices with relish. Everyone has a much deeper understanding of memory corruption attacks and the current state of mitigation techniques

Why we took the course?

Team Thinkst is quite a diverse bunch and exploitation isn’t part of anyone’s day job. We do however, place a huge emphasis on learning, and the opportunity to dedicate some time to bare metal, syscalls and shellcode was too good to pass up. We’ve taken group courses before, but this is the first time we’ve felt compelled to write it up. Two thumbs up! Will strongly recommend.

Using the Linux Audit System to detect badness

Security vendors have a mediocre track record in keeping their own applications and infrastructure safe. As a security product company, we need to make sure that we don’t get compromised. But we also need to plan for the horrible event that a customer console is compromised, at which point the goal is to quickly detect the breach. This post talks about how we use Linux's Audit System (LAS) along with ELK (Elasticsearch, Logstash, and Kibana) to help us achieve this goal.


Every Canary customer has multiple Canaries on their network (physical, virtual, cloud) that reports in to their console which is hosted in AWS.

Consoles are single tenant, hardened instances that live in an AWS region. This architecture choice means that a single customer console being compromised, won’t translate to a compromise of other customer consoles. (In fact, customers would not trivially even discover other customer consoles, but that's irrelevant for this post.)

Hundreds of consoles running the same stack affords us an ideal opportunity to perform fine grained compromise detection in our fleet. Going into the project, we surmised that a bunch of servers doing the same thing with similar configs should mean we can detect and alert on deviations with low noise.

A blog post and tool by Slack's Ryan Huber pointed us in the direction of the Linux Audit System. (If you haven’t yet read Ryan's post, you should.)

LAS has been a part of the Linux kernel since at least 2.6.12. The easiest way to describe it is as an interface through which all syscalls can be monitored. You provide the kernel with rules for the things you’re interested in, and it pushes back events every time something happens which matches your rules. The audit subsystem itself is baked into the kernel, but the userspace tools to work with it come in various flavours, most notably the official “auditd” tools, “go-audit” (from Slack) and Auditbeat (from Elasticsearch).

Despite our love for Ryan/Slack, we went with Auditbeat mainly because it played so nicely with our existing Elasticsearch deployment. It meant we didn't need to bridge syslog or logfile to Elastic, but could read from the audit Netlink socket and send directly to Elastic.

From Audit to ELK

Our whole set-up is quite straightforward. In the diagram below, let's assume we run consoles in two AWS regions, US-East-1 and EU-West-2.

We run:
  • Auditbeat on every console to collect audit data and ship it off to Logstash;
  • A Logstash instance in each AWS region to consolidate events from all consoles and ship them off to Elasticsearch;
  • Elasticsearch for storage and querying;
  • Kibana for viewing the data;
  • ElastAlert (Yelp) to periodically run queries against our data and generate alerts;
  • Custom Python scriptlets to produce results that can't be expressed in search queries alone.

So, what does this give us?

A really simple one is to know whenever an authentication failure occurs on any of these servers. We know that the event will be linked to PAM (the subsystem Linux uses for most user authentication operations) and we know that the result will be a failure. So, we can create a rule which looks something like this:

auditd.result:fail AND auditd.data.op:PAM*

What happens here then, is:
  1. Attacker attempts to authenticate to an instance;
  2. This failure matches an audit rule, is caught by the kernel's audit subsystem and is pushed via Netlink socket to Auditbeat;
  3. Auditbeat immediately pushes the event to our logstash aggregator;
  4. Logstash performs basic filtering and pushes this into Elasticsearch (where we can view it via Kibana);
  5. ElastAlert runs every 10 seconds and generates our alerts (Slack/Email/SMS) to let us know something bad(™) happened.

Let's see what happens when an attacker lands on one of the servers, and attempts to create a listener (because it’s 1999 and she is trying a bindshell).
In 10 seconds or less we get this:

which expands to this:
From here, either we expect the activity and dismiss it, or we can go to Kibana and check what activity took place.

Filtering at the Elasticsearch/ElastAlert levels gives us several advantages. As Ryan pointed out), keeping as few rules / filters on the actual hosts, leaves a successful attacker in the dark in terms of what we are looking for.

Unknown unknowns

ElastAlert also gives us the possibility of using more complex rules, like “new term”.

This allows us to trivially alert when a console makes a connection to a server we’ve never contacted before, or if a console executes a process which it normally wouldn’t.

Running auditbeat on these consoles also gives us the opportunity to monitor file integrity. While standard audit rules allow you to watch reads, writes and attribute changes on specific files, Auditbeat also provides a file integrity module which makes this a little easier by allowing you to specify entire directories (recursively if you wish).

This gives us timeous alerts the moment any sensitive files or directories are modified.

Going past ordinary alerts

Finally, for alerts which require computation that can't be expressed in search queries alone we use Python scripts. For example, we implemented a script which queries the Elasticsearch API to obtain a list of hosts which have sent data in the last n-minutes. By maintaining state between runs, we can tell which consoles have stopped sending audit data (either because the console experienced an interruption or because Auditbeat was stopped by an attacker.) Elasticsearch provides a really simple REST API as well as some powerful aggregation features which makes working with the data super simple.


Our setup was fairly painless to get up and running, and we centrally manage and configure all the components via SaltStack. This also means that rules and configuration live in our regular configuration repo and and that administration overhead is low.

ELK is a bit of a beast and the flow from hundreds of Auditbeat instances means that one can easily get lost in endless months of tweaking and optimizing. Indeed, if diskspace is a problem, you might have to start this tweaking sooner rather than later, but we optimized instead for “shipping”. After a brief period to tweak the filters for obvious false positives, we pushed into production and our technical team pick up the audit/Slack alerts as part of our regular monitoring.

Wrapping up

It’s a straightforward setup, and it does what it says on the tin (just like Canary!). Combined with our other defenses, the Linux Audit System helps us sleep a little more soundly at night. I'm happy to say that so far we've never had an interrupted night's sleep!