Good Pain vs. Bad Pain

aka: You know it’s supposed to hurt, you just don’t know which kind of hurt is the good kind

One of the common problems when people start lifting weights (or doing CrossFit) is that they inadvertently overdo it. Why don’t they stop when it hurts? Because everyone knows it’s supposed to hurt. Hypertrophy is the goal, so the pain is part of the deal... right?

Pain, Guaranteed
In an old interview on the rise of Twitter, Ev Williams said something really interesting: in pursuit of the fabled startup we’ve gotten so used to praising the entrepreneurial struggle, and so often repeat the myth of the starving entrepreneur, that people tolerate the pain of a bad/unviable idea longer than they should. He said that seeing Twitter go viral, made it clearer how Odeo hadn’t.
When Twitter took off, he just about said: “So this is what traction feels like.”
This is an interesting problem. The cult of entrepreneurship is strong and there’s no shortage of glib one-liners pumping people up to worship the grind. In putting yourself through it, you could be building the ultimate beach body, or you could be setting yourself up for a lifetime of chronic back pain. So how can we tell the difference? How do you know if this is the necessary pain that all young companies endure or if you are actually giving yourself a digital hernia?

An obvious answer is customer feedback, but this isn’t that simple. I’ll discuss why, through the lens of two different products we built: Phish5 and Canary.

We built Phish5 in 2012 with the logic that network admins and security teams would be able to sign up, pay a few hundred dollars and run high-quality phishing campaigns against their own companies. It worked well and over the years had reasonable success. (By this I mean that it found customers all over the world, and it made a few hundred thousand dollars while costing a fraction of that to run). Over time, self-phishing became a bit of a cottage industry as more and more players entered the market. We still had some multinational customers using it so we kept the lights on, but didn’t invest too heavily in it..

In 2015 we released our Thinkst Canary. High-quality honeypots that would deploy in minutes and require almost zero management overhead. It took less than a month to realize that Canary was going to be different. Our early Phish5 sales were nearly always to people we knew (from our previous lives) while Canary almost instantly found customers we never knew in verticals we would never have explored.

Phish5 customers used the service until their license expired, and then (maybe) signed up for a 2nd round. Canary customers typically  add more Canaries to their networks part way through their subscription, upselling themselves in the process.
Most of all though, while we had Phish5 customers who told us they liked Phish5, Canary customers oozed “love”.
Paul Graham famously suggests that winning with a startup begins by “making something people love”. Part of the problem with this, is that like many tweeners, you can’t tell if it’s love when you’ve never been in love before.

We had users recommend Phish5 to their friends and it was even featured in an article or two. But it was only with Canary that we went: “Ooooh.. that’s what love feels like”. Email feedback was tangibly “gushy” and we’d increasingly hear Canary mentioned lovingly in security podcasts. The unsolicited feedback on Twitter was beautiful (and was much more than we could have asked for!)

https://canary.tools/love
We have our roots in the security research community and we work hard to push the boundaries with our products. Nothing shows love to a researcher, like other researchers citing (and building on) your work. Phish5 appeared in a few news pieces over the course of 5 years but the Canary family almost instantly slid into other people’s slide decks.

From short introductory vids to industry legends like CarnalOwnage discussing Canary usage in his day-job; from rock stars like Collin Mulliner stretching CanaryTokens for RE Detection, to smart folks like Mike Ruth talking about deploying Canaries at Scale. People (other than us) delivered talks and papers around our birds.. Nothing close happened with Phish5 (and to be honest, we didn’t know it happened ever).

The love delta is obviously reflected in our numbers too: With Canaries deployed on all 7 continents, more than 95% of our Canary sales are still inbound & word of mouth referrals. We’re not saying “We’ve won” or that Canary’s success is a fait accompli. But we do know that it’s on a radically different trajectory to anything we built before, and we wouldn’t have gotten here if we kept “grinding away” at Phish5.

Determination and focus are great, but make sure that your doggedness doesn’t stop you from ditching your Odeo to build your Twitter¹ .

__
¹ That wasn’t us.. That was Ev..  Check back with us in 5 years to see how it worked out for us

They see me rolling (back)

Moving backward is a feature too!

We go through a lot of pain to make sure that Canary deployments are quick and painless. It’s worth remembering that even though the deployment happened in minutes, a bunch of stuff has happened in the background. (Your bird created a crypto key-pair, exchanged the public key with your console, and registered itself as one of your birds).

From that point on, all communication between your bird and your console is encrypted (with a per-device key) and goes out via valid DNS requests. This makes sure that deployments are quick and simple, even on complex networks.

Once your bird is successfully deployed, it’s completely configurable via your Canary Console.
So with a few clicks, a user is able to change a deployed Canary from a Cisco Router, to a Windows Server



However mistakes happen and, as anyone who has remotely configured network interfaces over SSH can attest, remote network changes aren’t kind to missteps. How does your Canary react if you configure it with broken network settings? Your console will already warn you of certain obvious network misconfigurations that fail sanity checks. 


But what if someone enters settings that pass sanity checks but are simply wrong for the network in question? (For example, providing static IP settings which are incorrect for your Canary’s network location.)
Previously, this would simply mean that the Canary would apply the new settings, and promptly lose connectivity with the console as the IP settings aren’t valid. The fact that it could no longer reach the console, would mean that it couldn’t be “fixed” from the console, and some poor admin would need to trudge on over to the device to reconfigure it.


This sucks, so we introduced “Network Rollback”. If a customer applies a config that prevents the Canary from getting back to the Console, the Canary figures this out, and rolls back to its last known working settings.


Of course, we then give you a quick notification that “something bad happened”. This incident can also be sent via your regular notification channels, such as email, text message, syslog or Slack.


When you configure a bird which has recently been rolled back, you’ll get a warning too, so you know the settings have been rolled back.




We try as hard as possible to make sure people do the right thing by default, but when you get it wrong, we will try to give you a mulligan.

Some OpenCanary Updates


As a company, we are pretty huge fans of Open Source software. We use FLOSS extensively in our production stack and we make sure to give back where we can. One of the ways we do this, is by making our Canarytokens & OpenCanary projects open source and free to download.

People needing Canarytokens can use the free hosted instance we run at Canarytokens.org, or they are free to download the docker images to run on their own networks. Literally hundreds of thousands of tokens have been generated online and the docker images have been pretty widely deployed too.


Our paid Canary customers get their own hosted tokens server, the ability to trivially customize it, as well as some tokens that have not been ported over to the free-server yet.

The relationship between OpenCanary(left) and Canary(right) is less clear.


Marco and I recently spoke at LinuxConfZa where we discussed this. In the buildup to the talk, we added some new features to OpenCanary (mostly by backporting features from our paid Canary service) which we’d like to show you below:

What’s new

Portscans

We have added to the capability of OpenCanary’s portscan module. We now have detection for specific nmap portscans: nmap FIN, NULL, XMAS and OS scans. This means that OpenCanary has multiple event types for a portscan which would indicate which type of scan has been targeted at your OpenCanary sensor.

We make this happen in the background by adding some iptables rules. These new rules match specific packets based on what we can expect from different nmap scans. This then creates specific logs with chosen prefixes giving us a number of new, defined logtypes.


Using our usual portscan mechanism of monitoring this log file, we can start to alert based on the log’s prefix. If the log’s prefix is `nmapNull`, we know that the log was generated by iptables’ match rule associated with nmap’s NULL scan. And this process is repeated for all newly supported portscan dectections. Below is an example of an FIN scan by nmap,



Followed by what we would expect to see being logged by our OpenCanary.


As you can see, logtype 5005 which we already saw is tied to a nmap FIN scan event.

New Services

We have also added three new services that your OpenCanary can now emulate.

The first service is Redis. Redis “is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker”. Our Redis service fakes a full fledged Redis server - alerting on connect and on any subsequent commands sent to your OpenCanary.

The above image shows that the Redis service mimics the behaviour closely by returning errors for Redis commands that have not been given the correct number of arguments or our Redis server will return an authentication required error for correctly crafted Redis commands.

And here we can see the logs generated by our previous Redis-related indiscretions: the command we tried along with their additional arguments.

The second service is Git. Git is a version-control system used (mainly) for code repositories. Hence you can run a Git server that would allow for collaboration across organisations. Our Git service fakes a Git server - alerting on `clone` git commands which means someone is trying to get that repository of code.


Here we have tried to clone a repository by attempting to connect to the git service being run by OpenCanary. The interaction mimics that of the actual Git service and errors out in an expected way.


And in our OpenCanary logs, we will get the repository name that the attacker was looking for. This is helpful in understanding where your leak may be.

The last new service is actually a potential factory of hundreds of TCP services. It’s a service to build generic TCP banners. This allows you to create a TCP listener on an arbitrary port. You can select a banner to greet your client on connect, and can control subsequent responses sent to the client. This allows you to quickly model plaintext protocols.


In this example, we have set up our TCP banner service to mimic an SMTP server. We connect to the service and receive the first banner. When the client sends through more data to interrogate the service, she receives our data received banner.


We then alert and log on the full interaction creating a great audit trail for investigation.

General Cleanup

We have also heard the cries (we are sorry!) and have touched up some configuration documentation on opencanary.org along with our brand new OpenCanary logo.

Setting up an OpenCanary instance should be as simple as following the instructions over here.

Upgrading is dead simple too. Simply run `pip install opencanary --upgrade`. It will upgrade the version of your OpenCanary, but keep your old config. You can check out the new config file options over here.

Take it for a spin, you won’t be disappointed.

(We love Pull Requests and have been rumoured to respond with 0day Canary swag).

Ps. If you want honeypots that deploy in minutes, with 0 admin overhead, you really should check out our Thinkst Canary. It’s one of the most loved devices in infosec, and you will see why..

(Better) Canary Alerts in Slack

One of the things that surprise new Canary customers, is that we don't try particularly hard to keep customers looking at their consoles. (In fact, an early design goal for Canary was to make sure that our users didn't spend much time using our console at all).

We make sure that the console is pretty, and is functional but we aren't trying to become a customer's "one pane of glass". We want the Canaries deployed and then strive to get out of your way. You decide where your alerts should go (email, SMS, API, webhooks, Syslog, SIEM app), set up your birds, and then don't visit your console again until a Canary chirps..


We have hundreds of customers who never login to their consoles after the initial setup, and we're perfectly happy with this. Their alerts go to their destination of choice and that's what matters. Of these, dozens and dozens of customers rely heavily on getting their alerts piped into a Slack channel of their choice.

Getting your alerts into Slack is trivial:

  1. Create a channel in Slack
  2. Go to Setup, Webhooks, and select "Add Slack XXX"
  3. Select the channel you want your alerts to go to;
  4. (Thats it! Your Slack integration is done!)


Until recently, alerts that went into Slack were simple one way traffic, containing incident details.


While this suffices for most users, recently, Max and Jay sat down to make this even better. Alerts into Slack now look like this:


You'll notice that, by default, potential sensitive fields like passwords are now masked in Slack. This can be toggled on your Settings page. We're also including additional historical context to assist your responders.

Best of all though, you can now manage these alerts (Mark as seen and Delete) from right inside Slack, so you never have to login to your Console.


Once an event has been acknowledged, the incident details will be visually "struck", and a new field will indicate the name of the person who ack'd it.


Clicking "Delete" will then collapse the now superfluous details, and will track the name of the deleting user.


So.. if your security team is using Slack, consider using the integration. It will take just seconds to set up, and should make your life a little easier.



A Week with Saumil (aka "The ARM Exploit Laboratory")

Last month we downed tools for a week as we hosted a private, on-site version of the well regarded “ARM Exploit Laboratory” (by Saumil Shah). The class is billed as “a practical hands-on approach to exploit development on ARM based systems” and Saumil is world respected, delivering versions of the class at conferences like 44con, Recon and Blackhat for years.

It.absolutely.delivered!

With a quick refresher on ARM assembly and system programming on day-1, by day-2 everyone in the class was fairly comfortable writing their own shellcode on ARM. By the end of day-3 everyone was comfortable converting their payloads to ROP gadgets and by day-4 everybody had obtained reverse shells on emulated systems and actual vulnerable routers and IP-Cameras. Without any false modesty, this is due to Saumil's skill as an educator much more than anything else.

Pre-Class Preparation


While our Canary is used by security teams the world over, many people in the team have backgrounds in development (not security) so we felt we had some catching up to do. A few months before the class, we formed an #arm-pit slack channel and started going through the excellent Azeria Labs chapters and challenges. (It’s worth noting that Saumil's class managed to work for people on the team that were not taking part in our weekly #arm-pit sessions, but those of us who did the sessions were glad that we did anyway).

A special shout out to @anna who didn’t actually attend the ARM exploitation sessions but made sure that everything from food and drinks, to conference room and accomodation were all sorted. An echo that great preparation made for a great experience. Thank you @anna.

The Class


We’ve all sat in classes where the instructor raced ahead and knowledge that we thought we had proved to be poorly understood when we needed to apply it. As the course progressed, each new concept was challenged with practical exercises. Each concept needed to be understood as the following concepts (and exercises) would largely build on the prior knowledge. And in this fashion, we quickly weeded out gaps in our knowledge because practically we could not apply something we
did not understand.

The addition of shellcode-restrictions (and processor mitigations) tested a particular way of thinking which seemed to come more naturally to those of us with a history of “breaking” versus “building”. The breakers learned some new tricks, but the builders learned some completely new ways of thinking. It was illuminating.

The class was choc-full of other little gems, from methodologies for debugging under uncertainty to even just how slickly Saumil shares his live thoughts with his students via a class webserver (that converts his ascii-art memory layout diagrams to prettier SVG versions in real time)





It’s the mark of an experienced educator who has spotted the areas that students have struggled with, and has now built examples and tooling to help overcome them. We didn’t just learn ARM exploitation from the class, it was a master class in professionalism and how to educate others.

Where to now?


A bunch of people now have a gleam in their eyes and have started looking at their routers and IOT devices with relish. Everyone has a much deeper understanding of memory corruption attacks and the current state of mitigation techniques

Why we took the course?


Team Thinkst is quite a diverse bunch and exploitation isn’t part of anyone’s day job. We do however, place a huge emphasis on learning, and the opportunity to dedicate some time to bare metal, syscalls and shellcode was too good to pass up. We’ve taken group courses before, but this is the first time we’ve felt compelled to write it up. Two thumbs up! Will strongly recommend.

Using the Linux Audit System to detect badness

Security vendors have a mediocre track record in keeping their own applications and infrastructure safe. As a security product company, we need to make sure that we don’t get compromised. But we also need to plan for the horrible event that a customer console is compromised, at which point the goal is to quickly detect the breach. This post talks about how we use Linux's Audit System (LAS) along with ELK (Elasticsearch, Logstash, and Kibana) to help us achieve this goal.

Background

Every Canary customer has multiple Canaries on their network (physical, virtual, cloud) that reports in to their console which is hosted in AWS.


Consoles are single tenant, hardened instances that live in an AWS region. This architecture choice means that a single customer console being compromised, won’t translate to a compromise of other customer consoles. (In fact, customers would not trivially even discover other customer consoles, but that's irrelevant for this post.)

Hundreds of consoles running the same stack affords us an ideal opportunity to perform fine grained compromise detection in our fleet. Going into the project, we surmised that a bunch of servers doing the same thing with similar configs should mean we can detect and alert on deviations with low noise.

A blog post and tool by Slack's Ryan Huber pointed us in the direction of the Linux Audit System. (If you haven’t yet read Ryan's post, you should.)

LAS has been a part of the Linux kernel since at least 2.6.12. The easiest way to describe it is as an interface through which all syscalls can be monitored. You provide the kernel with rules for the things you’re interested in, and it pushes back events every time something happens which matches your rules. The audit subsystem itself is baked into the kernel, but the userspace tools to work with it come in various flavours, most notably the official “auditd” tools, “go-audit” (from Slack) and Auditbeat (from Elasticsearch).

Despite our love for Ryan/Slack, we went with Auditbeat mainly because it played so nicely with our existing Elasticsearch deployment. It meant we didn't need to bridge syslog or logfile to Elastic, but could read from the audit Netlink socket and send directly to Elastic.

From Audit to ELK

Our whole set-up is quite straightforward. In the diagram below, let's assume we run consoles in two AWS regions, US-East-1 and EU-West-2.




We run:
  • Auditbeat on every console to collect audit data and ship it off to Logstash;
  • A Logstash instance in each AWS region to consolidate events from all consoles and ship them off to Elasticsearch;
  • Elasticsearch for storage and querying;
  • Kibana for viewing the data;
  • ElastAlert (Yelp) to periodically run queries against our data and generate alerts;
  • Custom Python scriptlets to produce results that can't be expressed in search queries alone.

So, what does this give us?

A really simple one is to know whenever an authentication failure occurs on any of these servers. We know that the event will be linked to PAM (the subsystem Linux uses for most user authentication operations) and we know that the result will be a failure. So, we can create a rule which looks something like this:

auditd.result:fail AND auditd.data.op:PAM*


What happens here then, is:
  1. Attacker attempts to authenticate to an instance;
  2. This failure matches an audit rule, is caught by the kernel's audit subsystem and is pushed via Netlink socket to Auditbeat;
  3. Auditbeat immediately pushes the event to our logstash aggregator;
  4. Logstash performs basic filtering and pushes this into Elasticsearch (where we can view it via Kibana);
  5. ElastAlert runs every 10 seconds and generates our alerts (Slack/Email/SMS) to let us know something bad(™) happened.






Let's see what happens when an attacker lands on one of the servers, and attempts to create a listener (because it’s 1999 and she is trying a bindshell).
In 10 seconds or less we get this:


which expands to this:
From here, either we expect the activity and dismiss it, or we can go to Kibana and check what activity took place.

Filtering at the Elasticsearch/ElastAlert levels gives us several advantages. As Ryan pointed out), keeping as few rules / filters on the actual hosts, leaves a successful attacker in the dark in terms of what we are looking for.

Unknown unknowns

ElastAlert also gives us the possibility of using more complex rules, like “new term”.

This allows us to trivially alert when a console makes a connection to a server we’ve never contacted before, or if a console executes a process which it normally wouldn’t.

Running auditbeat on these consoles also gives us the opportunity to monitor file integrity. While standard audit rules allow you to watch reads, writes and attribute changes on specific files, Auditbeat also provides a file integrity module which makes this a little easier by allowing you to specify entire directories (recursively if you wish).

This gives us timeous alerts the moment any sensitive files or directories are modified.



Going past ordinary alerts

Finally, for alerts which require computation that can't be expressed in search queries alone we use Python scripts. For example, we implemented a script which queries the Elasticsearch API to obtain a list of hosts which have sent data in the last n-minutes. By maintaining state between runs, we can tell which consoles have stopped sending audit data (either because the console experienced an interruption or because Auditbeat was stopped by an attacker.) Elasticsearch provides a really simple REST API as well as some powerful aggregation features which makes working with the data super simple.

Operations

Our setup was fairly painless to get up and running, and we centrally manage and configure all the components via SaltStack. This also means that rules and configuration live in our regular configuration repo and and that administration overhead is low.

ELK is a bit of a beast and the flow from hundreds of Auditbeat instances means that one can easily get lost in endless months of tweaking and optimizing. Indeed, if diskspace is a problem, you might have to start this tweaking sooner rather than later, but we optimized instead for “shipping”. After a brief period to tweak the filters for obvious false positives, we pushed into production and our technical team pick up the audit/Slack alerts as part of our regular monitoring.

Wrapping up

It’s a straightforward setup, and it does what it says on the tin (just like Canary!). Combined with our other defenses, the Linux Audit System helps us sleep a little more soundly at night. I'm happy to say that so far we've never had an interrupted night's sleep!

RSAC 2018 - A Recap...

This year we attended the RSAC expo in San Francisco as a vendor (with booth, swag & badge scanners!).

We documented the trip, it’s quirks, costs and benefits along with some thoughts on the event.

Check it out, and feel free to drop us a note on the post or by tweeting at @ThinkstCanary.

Considering an RSAC Expo booth? Our Experience, in 5,000 words or less



A third party view on the security of the Canaries

(Guest post by Ollie Whitehouse)

tl;dr

Thinkst engaged NCC Group to perform a third party assessment of the security of their Canary appliance. The Canaries came out of the assessment well. When compared in a subjective manner to the vast majority of embedded devices and/or security products we have assessed and researched over the last 18 years they were very good.

Who is NCC Group and who am I?

Firstly, it is prudent to introduce myself and the company I represent. My name is Ollie Whitehouse and I am the Global CTO for NCC Group. My career in cyber spans over 20 years in areas such as applied research, internal product security teams at companies like BlackBerry and, of course, consultancy. NCC Group is a global professional and managed security firm with its headquarters in the UK and offices in the USA, Canada, Netherlands, Denmark, Spain, Singapore and Australia to mention but a few.

What were we engaged to do?

Quite simply we were tasked to see if we could identify any vulnerabilities in the Canary appliance that would have a meaningful impact on real-world deployments in real-world threat scenarios. The assessment was entirely white box (i.e. undertaken with full knowledge and code access etc.)

Specifically the solution was assessed for:

·       Common software vulnerabilities

·       Configuration issues

·       Logic issues including those involving the enrolment and update processes

·       General privacy and integrity of the solution

The solution was NOT assessed for:

·       The efficacy of Canary in an environment

·       The ability to fingerprint and detect a Canary

·       Operational security of the Thinkst SaaS

What did NCC Group find?

NCC Group staffed a team with a combined experience of over 30 years in software security assessments to undertake this review for what I consider a reasonable amount of time given the code base size and product complexity.

We found a few minor issues, including a few broken vulnerability chains, but overall we did not find anything that would facilitate a remote breach.

While we would never make any warranties it is clear from the choice of programming languages, design and implementation that there is a defence in depth model in place. The primitives around cryptography usage are also robust, avoiding many of the pitfalls seen more widely in the market.

The conclusion of our evaluation is that the Canary platform is well designed and well implemented from a security perspective. Although there were some vulnerabilities, none of these were significant, none would be accessible to an unauthenticated attacker and none affected the administrative console. The Canary device is robust from a product security perspective based on current understanding.

So overall?

The device platform and its software stack (outside of the base OS) has been designed and implemented by a team at Thinkst with a history in code product assessments and penetration testing (a worthy opponent one might argue), and this shows in the positive results from our evaluation.

Overall, Thinkst have done a good job and shown they are invested in producing not only a security product but also a secure product.

_________

<haroon> Are you a customer who wishes to grab a copy of the report? Mail us and we will make it happen.


Sandboxing: a dig into building your security pit

Introduction

Sandboxes are a good idea. Whether it's improving kids’ immune systems, or isolating your apps from the rest of the system, sandboxes just make sense. Despite their obvious benefits, they are still relatively uncommon. We think this is because they are still relatively obscure for most developers and hope this post will fix that.

Sandboxes? What’s that?

Software sandboxes isolate a process from the rest of the system, constraining the process’ access to the parts of the system that it needs and denying access to everything else. A simple example of this would be opening a PDF in (a modern version of) Adobe Reader. Since Adobe Reader now makes use of a sandbox, the document is opened in a process running in its own constrained world so that it is isolated from the rest of the system. This limits the harm that a malicious document can cause and is one of the reasons why malicious PDFs have dropped from being the number-1 attack vector seen in the wild as more and more users updated to sandbox-enabled versions of Adobe-Reader.

It's worth noting that sandboxes aren't magic, they simply limit the tools available to an attacker and limit an exploit’s immediate blast-radius. Bugs in the sandboxing process can still yield full access to key parts of the system rendering the sandbox almost useless.

Sandboxes in Canary

Long time readers will know that Canary is our well-loved honeypot solution. (If you are interested in breach detection that’s quick to deploy and works, check us out at https://canary.tools/)


A Canary is a high quality, mixed interaction honeypot. It’s a small device that you plug into your network which is then able to imitate a large range of machines (a printer/ your CEO's laptop/ a file server, etc). Once configured it will run zero or more services such as SSH, Telnet, a database or Windows File Sharing. When people interact with these fake hosts and fake services, you get an alert (and a high quality signal that you should cancel your weekend plans).

Almost all of our services are implemented in a memory safe language, but in the event that customers want a Windows File Share, we rely on the venerable Samba project (before settling on Samba, we examined other SMB possibilities, like the excellent impacket library, but Samba won since our Canaries (and their file shares) can be enrolled into Active Directory too). Since Samba is running as its own service and we don't have complete control over its internal workings, it becomes a prime candidate for sandboxing: we wanted to be able to restrict it's access to the rest of the system in case it is ever compromised.

Sandboxing 101

As a very brief introduction to sandboxing we'll explain some key parts of what Linux has to offer (a quick Google search will yield far more comprehensive articles, but one interesting resource, although not Linux focused, is this video about Microsoft Sandbox Mitigations).

Linux offers several ways to limit processes which we took into consideration when deciding on a solution that would suit us. When implementing a sandbox solution you would chose a combination of these depending on your environment and what makes sense.


Control groups

Control groups (cgroups) look at limiting and controlling access and usage of resources such as CPU, memory, disk, network, etc.


Chroot

This involves changing the apparent root directory on a file-system that the process can see. It ensures that the process does not have access to the whole file system, but only parts that it should be able to see. Chroot was one of the first attempts at sandboxes in the Unix world, but it was quickly determined that it wasn’t enough to constrain attackers.


Seccomp

Standing for "secure computing mode", this lets you limit the syscalls that a process can make. Limiting syscalls means that a process will only be able to perform system operations that you expect to be able to perform so if an attacker compromises your application, they won't be able to run wild.


Capabilities

These are the set of privileged operations that can be performed on the Linux system. Some capabilities include setuid, chroot and chown. For a full list you can take a look at the source here. However, they’re also not a panacea and spender has shown (frequently) how Capabilities can be leveraged into full Capabilities.


Namespaces

Without namespaces, any processes would be able to see all processes' system resource information. Namespaces virtualise resources like hostnames, user IDs or network resources so that a process cannot see information from other processes.

Adding sandboxing to your application in the past meant using some of these primitives natively (which probably seemed hairy for most developers). Fortunately, these days, there are a number of projects that wrap them up in easy-to-use packages.



Choosing our solution

We needed to find a solution that would work well for us now, but would also allow us to easily expand once the need arises without requiring a rebuild from the ground up.

The solution we wanted would need to at least address Seccomp filtering and a form of chroot/pivot_root. Filtering syscalls is easy to control if you can get the full profile of a service and once filtered you can sleep a little safer knowing the service can't perform syscalls that it shouldn't. Limiting the view of the filesystem was another easy choice for us. Samba only needs access to specific directories and files, and lots of those files can also be set to read-only.

We evaluated a number of options, and decided that the final solution should:

  • Isolate the process (Samba)
  • Retain the real hostname
  • Still be able to interact with a non-isolated process
Another process had to be able to intercept Samba network traffic which meant we couldn’t put it in a network namespace without bringing that extra process in.

This ruled out something like Docker, as although it provided an out-of-the-box high level of isolation (which is perfect for many situations), we would have had to turn off a lot of the features to get our app to play nicely.

Systemd and nsroot (which looks abandoned) both focused more on specific isolation techniques (seccomp filtering for Systemd and namespace isolation for nsroot) but weren’t sufficient for our use case.

We then looked at NsJail and Firejail (Google vs Mozilla, although that played no part in our decision). Both were fairly similar and provided us with flexibility in terms of what we could limit, putting them a cut above the rest.

In the end, we decided on NsJail, but since they were so similar, we could have easily gone the other way, i.e. YMMV


NsJail
NsJail, as simply stated in its overview, "is a process isolation tool for Linux" developed by the team at Google (though it's not officially recognised as a Google product). It provides isolation for namespaces, file-system constraints, resource limits, seccomp filters, cloned/isolated ethernet interfaces and control groups.

Furthermore, it uses kafel (another non-official Google product) which allows you to define syscall filtering policies in a config file, making it easy to manage/maintain/reuse/expand your configuration.

A simple example of using NsJail to isolate processes would be:

./nsjail -Mo --chroot /var/safe_directory --user 99999 --group 99999 -- /bin/sh -i
Here we are telling NsJail to:
-Mo:               launch a single process using clone/execve
 
--chroot:          set /var/safe_directory as the new root directory for the process

--user/--group:    set the uid and gid to 99999 inside the jail

-- /bin/sh -i:     our sandboxed process (in this case, launch an interactive shell)
We are setting our chroot to /var/safe_directory. It is a valid chroot that we have created beforehand. You can instead use  --chroot / for your testing purposes (in which case you really aren’t using the chroot at all).

If you launch this and run ps aux and id you’ll see something like the below:
$ ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
99999        1  0.0  0.1   1824  1080 ?        SNs  12:26   0:00 /bin/sh -i
99999       11  0.0  0.1   3392  1852 ?        RN   12:32   0:00 ps ux
$ id
uid=99999 gid=99999 groups=99999
What you can see is that you are only able to view processes initiated inside the jail.

Now lets try adding a filter to this:

./nsjail --chroot /var/safe_directory  --user 99999 --group 99999 --seccomp_string 'POLICY a { ALLOW { write, execve, brk, access, mmap, open, newfstat, close, read, mprotect, arch_prctl, munmap, getuid, getgid, getpid, rt_sigaction, geteuid, getppid, getcwd, getegid, ioctl, fcntl, newstat, clone, wait4, rt_sigreturn, exit_group } } USE a DEFAULT KILL' -- /bin/sh -i
Here we are telling NsJail to:
-Mo:               launch a single process using clone/execve
 
--chroot:          set /var/safe_directory as the new root directory for the process

--user/--group:    set the uid and gid to 99999 inside the jail

--seccomp_string:  use the provided seccomp policy

-- /bin/sh -i:     our sandboxed process (in this case, launch an interactive shell)
If you try run id now you should see it fail. This is because we have not given it permission to use the required syscalls:
$ id
Bad system call
The idea for us then would be to use NsJail to execute smbd as well as nmbd (both are needed for our Samba setup) and only allow expected syscalls.

Building our solution
Starting with a blank config file, and focusing on smbd, we began adding restrictions to lock down the service.

First we built the the seccomp filter list to ensure the process only had access to syscalls that were needed. This was easily obtained using perf:

perf record -e 'raw_syscalls:sys_enter' -- /usr/sbin/smbd -F
This recorded all syscalls used by smbd into perf's format. To output the syscalls in a readable list format we used:
perf script | grep -oP "(?<= NR )[0-9]+" | sort -nu
One thing to mention here is that syscall numbers can be named differently depending where you look. Even just between `strace` and `nsjail`, a few syscall names have slight variations from the names in the Linux source. This means that if you use the syscall names you won't be able to directly use the exact same list between different tools, but may need to rename a few of them. If you are worried about this, you can opt instead to use the syscall numbers. These are a robust, tool-independent way of identifying syscalls.

After we had our list in place, we set about limiting FS access as well as fiddling with some final settings in our policy to ensure it was locked down as tight as possible.

A rather interesting way to test that the config file was working as expected was to launch a shell using the config and test the protections manually:

./nsjail --config smb.cfg -- /bin/sh -i
Once the policy was tested and we were happy that smbd was running as expected, we did the same for nmbd.

With both services sandboxed we performed a couple of long running tests to ensure we hadn't missed anything. This included leaving the services running over the weekend as well as testing them out by connecting to them from different systems. After all the testing and not finding anything broken, we were happy to sign off.

What does this mean for us?

Most canned exploits against Samba expect a stock system with access to system resources. At some point in the future, when the next Samba 0-day surfaces, there’s a good chance that generic exploits against our Samba will fail as it tries to exercise syscalls we haven’t expressly permitted. But even if an attacker were to compromise Samba, and spawn himself a shell, this shell would be of limited utility with a constrained view of the filesystem and the system in general.

What does this mean for you?
We stepped you through our process of implementing a sandbox for our Samba service. The aim was to get you thinking about your own environment and how sandboxing could play a role in securing your applications. We wanted to show you that it isn't an expensive or overly complicated task. You should try it, and if you do, drop us a note to let us know how it went!