Edgemesh’s “Clean Slate” Protocol:

Blowing up data centers … it’s ok it’s by design

Iron Man’s suits aren’t people, and Container’s aren’t pets. Blow them up … you’ll thank us later.
In the old way of doing things, we treat our servers like pets, for example Bob the mail server. If Bob goes down, it’s all hands on deck. The CEO can’t get his email and it’s the end of the world. In the new way, servers are numbered, like cattle in a herd. For example, www001 to www100. When one server goes down, it’s taken out back, shot, and replaced on the line.

The Pets vs. Cattle metaphor is an often used one in the Docker/Kubernetes/Container world. It appears to date back to Bill Baker in 2011 and I think it’s a good metaphor that has held up well despite the continued rate of change in container land. At the core is the idea that cloud native applications should scale out and allow for graceful failover between multiple nameless instances.

At Edgemesh we run our entire platform on Docker, and specifically we do most of the heavy lifting on the Joyent Public Cloud. We make heavy use of Joyent’s Autopilot pattern which allows our containers to operate with a high degree of autonomy. We run our database systems inside the containers as well with on the metal performance via Triton. Finally we ensure that state information (database files, log files etc) are stored on stable storage via Joyent Manta with a secondary backup to Google’s Storage Platform. These are all fairly standard Container Native practices these days.

But we do something which most people would find strange:

Every night we destroy every production container in every data center and redeploy the entire platform.

Yup. Every day.

In some ways this can be seen as our version of Netflix’s approach to ensuring application stability (e.g. the Chaos Monkey) and certainly a daily wipe and reload of the entire platform ensures that we keep the platform protected from some levels of force majeure failure … but the Clean Slate Protocol (as it’s come to be known) is really about 3 things:

  1. Ensuring consistent deployment methodology
  2. Decreasing the attack surface of the platform
  3. Ensuring consistent (and economical) self scaling

1. Ensuring consistent deployment methodology

When your entire platform is wiped out every day, no matter what , EVERYTHING has to be automated. For us, this ensures that nothing makes it into production without a fully automated bootstrap and recovery process in place.

With Joyent’s Autopilot pattern in place, every docker instance has a preStart, Health check, preStop and postStop process flow. We also make heavy use Consul.io for automated service registration and discovery (and master process detection/election).

Below is an example for our API gateways which are Node.js processes that facilitate the cross mesh communication between #WebRTC clients. The most complicated component here is the bootstrapping of the current SSL certificates for our domain. All state-full data (such as our certs) are stored on Joyent’s Manta storage platform (with backup to Google). When the API gateway docker instance starts, the Autopilot process is the main application (PID 0). Autopilot reads our configuration file (JSON) and executes the preStart actions that confirm the required backend services are available, and then loads in the certificate data via a Manta GET (similar to an S3 GET). From there, once we have a public IP address the docker instance updates the DNS record and comes online.

When the instance is destroyed — Autopilot ensures we remove the DNS record for this instance and posts an alert to our Ops Slack channel.

This model of ensuring that all data sits on a distributed stable storage platform (Manta) extends through out entire stack — even the database layer. Our database servers copy the historical database files on startup, and on shutdown ensure the day’s data is pushed to stable storage. This allows us to enforce daily integrity checks (on startup) and avoids the NFS mount Anti-pattern.

At Edgemesh, our end of day starts at midnight UTC. A handy trick we use is that every docker instance we run always operates on UTC.

# In base image: set timezone to UTC: we live in a UTC world.
RUN apt_in apt-utils xz-utils && echo UTC > /etc/timezone && dpkg-reconfigure — frontend noninteractive tzdata && apt_clean

Database instances are scale out (and replicate the day’s data across DC’s) but on a Clean Slate each datacenter copies in all the logs from the prior day from every datacenter and performs it’s own integrity check. An example below shows the structure for database loads in Manta.

2. Decreasing the attack surface of the platform

Many years ago, a friend of mine from a 3 letter agency told me his approach to security and it stuck:

The bad guys are already in your system, so the question is how do you clean out the infestation? Whatever your process for a post intrusion response — just do that every day.

We try really, really hard to make sure we write secure code — but nobody is perfect. Clean Slate offers some level of increased defense in that if an attacker penetrates a system , they have at most 24 hours to do their work before the board is reset (new instance). For our internet facing systems we can easily lower this window (to say 4 hours). Our central points of concern are now our storage server (Manta) and our base docker images themselves (which we always start with FROM scratch ).

3. Ensuring consistent (and economical) self scaling

This is an interesting side effect for us. When Clean Slate runs, the new instances start in the smallest possible sizes and configurations. Our instances can scale up (increase machine resources) and scale out — and do so automatically , but scaling down is much harder. With a daily reset per datacenter we get a nice consistent daily reset of our instance sizes (and their corresponding costs). Clean Slate executes on one datacenter at a time in a co-ordinated effort , and self scaling takes place automatically. Below is an example of the scaling logs from production:

In the above logs, we can see the Clean Slate deploy events (in green) — and then we see the corresponding scale requests arriving from specific services (highlighted). The result is exactly what you’d expect — on weekends and lower traffic days Edgemesh runs with minimal sized instances , and during times of high use the autoscaler ups the size. Simple. Painless right sizing.

What does Clean Slate look like?

The actual code base is a (large) bash script that ensures we don’t take down data centers simultaneously, and other operational knowledge but since Joyent Triton is one giant Docker host per datacenter — the core code is super small:


Until next time!

If you’d like more details on Clean Slate please let us know on Twitter @EdgemeshInc