HackWeek 2018

Two weeks ago we ran the second edition of our internal HackWeek, and it was fantastic. Last year’s event was great fun and produced projects we still use; going into this year’s HackWeek we anticipated a leveling up, and weren’t disappointed. We figured we’d talk a little bit about the week, and discuss some of the “hacks”.

Our HackWeek parameters are simple: We downtools on all but the most essential work (primarily anything customer-facing) and instead scope and build something. The project absolutely does not have to be work-related, and people can work individually or in teams. The key deadline is a 10-minute demo on the Friday afternoon. The demos are in front of the rest of the team, and results count more than intentions.

Everyone participated and everyone presented at the Friday demo, including sales, dev, support, back office and yours truly. We strive to keep Thinkst a learning organisation and this HackWeek is one way that we do it. For example, it’s great to see a salesperson taking their first steps in writing Python, and our HackWeek helps make that happen. Here’s a roundup of a few of the notable submissions.

Portable Demo Kit
Bradley showed an early diversion into hardware hacking with his jury-rigged demo station. We often demo Canary over WebEx/GoToMeeting, and he decided to spend his HackWeek upgrading the current webcam setup.

He removed a camera from a non-functioning laptop, added some LED’s for lighting, attached both to a single USB cable, and then kept iterating on packaging until he had a tiny unit that hides in a pocket, but sets up for great overhead shots.
It appears to have cost his kids a few toy arrows, but was totally worth it! Wish him luck getting home-rolled electronics through airport security...

CanaryQL
Az was up next and blew us away with his OSQuery-like hack to make our back-end infrastructure data more queryable in real-time. It’s pretty neat, SQLite lets you write plugins to incorporate underlying data sources which look nothing like relational tables. The upshot of this project is that we can run SQL queries which go out and fetch data from our customer consoles using SaltStack, and perform standard actions like filtering and joins.
I’m hoping we write a CanaryQL blog post of in good time. Projection Central Anna used the week to claim a piece of our downstairs office wall. She started by projecting a simple web page on the wall which showed off our customer tweets, and then gradually iterated the complexity upwards.
Step 2 displayed a cool animated clock, Step 3 showed bird deployments, and Step 4 integrated a websockets based chat system (allowing people in the office to send messages that would now display on the projector). This is perfect for kicking off long running jobs that notify people downstairs when done. Part of what made this awesome is the fact that Anna never touched Python before HackWeek! She summarised her win early on with a John Gall quote I love:
Kinect Resurrection
Jay swapped projects midstream, and eventually went for a hack related to the Kinect. This meant resurrecting and saving an old device before creating an office facial recognition based IDS.
A Better MouseTrap We have a janky internal system to test sample SD Cards comprising of a series of Raspberry Pi’s and a terrible-looking breadboard. Marco decided to replace the breadboard rat’s nest with a custom circuit board, built in the office. This meant turning the office into a meth-lab and a lot of fails.

Of course in true Marco fashion, he prevailed, in time and under budget:
Canary-War Nick and Max teamed up to build a Unity3d based game they called Canary-War. They designed the characters from scratch in Blender and then built the game mechanics for a multiplayer game in a week. Pretty awesome..
Grafana meets IoT Danielle decided that Grafana dashboards that merely displayed data from IoT devices were too limiting, and hacked a module using MQTT & WebSockets to get bi-directional comms going with her IoT device. Since Grafana is designed to be uni-directional, this took some finagling.
Instapaper for Video My project was purely to scratch my own itch. I wanted a way to tag video links during the day, and to then have them magically saved on my iPad for later viewing.
I ended up with a Rube Goldberg machine called savemyvid.net. Essentially this lets me send a video link to an email address, which is then parsed by an EC2 server which downloads the video and adds it to my personal podcast. My iPad then subscribes and auto-downloads episodes for that podcast so the videos are there even if I’m on a plane with no connectivity.
I’ve extended this to make the system multi-user, so I’ll blog about this one separately too.
Summary
It’s probably enough to say “a fun time was had by all” and end it there, because if we can’t have hacker fun, then what is this all for any way? But there’s always more. Post the presentations, we noted at least the following points on our internal Slack:

(Ed's warning: cut & paste from internal slack)
  • Make sure we always give credit for stuff we use from other people. It breeds a type of academic honesty that’s important and clarifying, and gets us into the habit of more generally giving credit when it’s due.
  • We often talk about “being a learning org” and the HackWeek demos warmed my heart for it. Az said, “last time I missed the mark by doing A, so now I did B”. I also heavily changed from the last HackWeek. (Last year, I planned time for HackWeek and “work happened”, and I barely shipped. This time, work also happened, but I expected it, so I had cleared up personal time heavily and it gave me enough time to ship satisfactorily). Learning (from past mistakes) is what we do.
  • Why bother? Things like a HackWeek come and go and if you don’t stretch for it, there’s actually no perceptible difference to your life. In fact you quickly figure out that life is much easier if you don’t put yourself into stretch-needing situations. The reason for consciously doing it during an artificial one week sprint? Because you’re building those muscles; during a HackWeek, you’re not just building the new tech skills you bumped into, but also meta skills. Skills like knowing when to dive deep or when to walk, how to pick a date, commit and ship. It’s super trite, but ultimately, “we are what we repeatedly do”.

Making NGINX slightly less “surprising”

Dan Geer famously declared that security is “the absence of unmitigatable surprise”. He said it while discussing how dependence is the root source of risk, where increasing system dependencies change the nature of surprises that emanate from composed systems.  Recently, two of our servers “surprised” us due to an unexpected dependence, and we thought this incident was worth talking about. (We also discuss how to mitigate such surprises going forward).

Background:
Every Canary deployment is made up of at least two pieces. Canaries (hardware, VM or Cloud) that then report in to the customer’s dedicated console hosted in EC2. We’ve gone to great lengths to make sure that the code and infrastructure we run is secure, and we ensure that any activity on these servers that isn’t expected, is raised in the form of an alert.

A few weeks ago, this real-time auditing activity tripped an alert on a development server. Servers are either built as production servers, which have been tested and effectively have a frozen footprint, or as development servers which are used by our developers for testing. The anomalous activity triggered our incident response process, and the response team swung into action.

A quick check on the server showed that it was owned by an automated tool blasting through the AWS address space looking for a bunch of simple misconfigs and attack vectors. On this dev server, it found a debug interface at the path  /console. This debug console was foundational to the framework and was automatically included whenever the framework’s debug flag was enabled. Importantly, it’s served from deep within the framework, and introspecting the application’s internal routes didn’t show /console.


What a gift! The developer working on this console enabled debug mode without realizing its full implications, and the attacker’s script found it in time. The fault here was ours, we turned on debug mode in Flask, but our surprise came from the fact that we never expected our webserver to serve up pages we didn’t know about.

The immediate fix was to disable the debug flag and reboot the server, which killed the access. (We subsequently tore down all our dev servers, not because of signs of compromise, but because our tooling makes it trivial to launch new ones.) However we wanted to examine the pattern a little more closely, to see if we could reduce our unmitigatable surprises. If /console was present, what other surprises await us now, or in the future with new developments happening?

So we started looking at creating an whitelist generator for NGINX, which is the web server we rely on. What we had in mind was a stand-alone tool that would coax NGINX to only serve documents from known paths, with minimal effort and minimal impact to an existing setup.

1) What we tried that didn’t work
NGINX has a module which allows one to embed Lua scripts inside the config file. We explored this (because we really wanted an opportunity to play with Lua) but ultimately rejected it as Lua support isn’t part of most default NGINX packages. We’d have to build the NGINX package from scratch, which would create an additional operational burden, and so fails our minimal effort and impact goals. We then explored the custom-built NGINX Javascript module referred to as njs, which is a scripting module developed and supported by NGINX themselves. It installs on top of existing NGINX setups and is super cool and interesting, but also turned out to be too limited for our needs. (It essentially prevented us from being able to call out and inspect the Flask setup to learn about valid routes).

2) What we currently do
tl;dr
    1. Grab nginx_flaskapp_whitelister from our Github account;
    2. Run nginx_flaskapp_whitelister to generate a new include.whitelist file
    3. Include this file into your current NGINX config.

The skinny:
A typical Flask app has a url_map object which holds all of the routes used in the application. Assume we have a Flask app, with defined routes that look something like this:
 ‘/‘, ‘/login’, ‘/chat’ and ‘/admin’
Our url_map will look like this:


Now NGINX has a concept of whitelisting routes, using what they refer to as “location directives”. A simple pseudo-configuration will look something like this:


So the basic lookup sequence of NGINX, to determine what to serve up for a requested path is as follows:

    1. NGINX looks for exact matching routes (defined with ‘=‘), 
    2. It turns to the longest matching routes defined with prefixes (modifiers such as ‘^~’) 
    3. It will turn to its default lookup method as regular expression matches to the longest matching route. 

Thus by specifying locations using the ’=‘ and ‘^~’ modifiers, you are able to override the natural behaviour of route lookups.

We make use of this by extracting all rules/routes defined for our app (by grabbing it from the apps associated url_map object) and mangle this into a neatly bundled, bite-size chunk for NGINX. We then fetch the current NGINX config describing the '/' route of the running server (most likely serving the allowed current endpoints).

We then create a separate include.whitelist file with the following NGINX configuration:
If an endpoint is exactly equal to '/' or it is equal to any of the fetched Flask endpoints, pass it to the original NGINX config that was fetched,
OR
For any other endpoints being requested return a 404. 

Including this whitelist file ahead of the location directive definitions ensures that the config written by the tool will take priority over possible conflicts without overriding or ignoring additional existing configuration oddities in the current setup.

So with a simple git clone and install:


…and a run of the tool:


…you are sure that even if a dev accidentally enables ‘/console’ again, it won’t be served, as you do not explicitly allow it. Instead you will be served up an NGINX 404 page (or if so defined in your NGINX configuration file, a custom 404 page - like we do!):


Example: for implementing the nginx_flaskapp_whitelister for your Flask application called app, that is defined in the file /module/flask.py and run from /path/to/python/virtualenv; you would run the following command:


3) How it fits into our pipeline
We use SaltStack for easing our configuration management, with their event-driven orchestration and automation through the use of configuration files referred to as ‘recipes’. Including our NGINX Flask App Whitelisting tool into our Salt recipes was quite simple: a recipe ensures the tool is present and installed, and the tool is run after the NGINX config file is deployed, but before the NGINX process is started.

4) Where you can find it
The nginx_flaskapp_whitelister tool is available on our Github page at https://github.com/thinkst/nginx_flaskapp_whitelister.

Epilogue
As trite as it sounds, nothing beats solid security design and multiple layers of detection. We were able to discover the compromise within minutes because all SYSCALLS on our servers are audited and exceptions are alerted on. The deployed architecture means that a single server compromise doesn’t leak anything to an attacker to let her target other servers and she has no preferential access because of her compromise. Now, with all of our servers running nginx_flaskapp_whitelister, there’s less chance of them surprising us too.

Edit: Also check out https://blog.eutopian.io/elephant-proofing-your-web-servers/ by @nickdothutton