The blog

Vagrant, Docker and Ansible. WTF?

Given that we’re building a SaaS that helps our client managing their infrastructure, our team is pretty familiar with leveraging VMs and configuration management tools. We’ve actually been heavy users of Vagrant and Ansible for the past year, and it’s helped us tremendously normalize our development process.

As our platform grew in complexity, some additional needs emerged:

  • Containerization; we needed to be able to safely execute custom, and potentially harmful, code.
  • Weight; as we added more sub-systems to, having full blown VMs proved to be hard to juggle with when testing and developing.

And that’s why we ended up adding Docker to our development workflow. We were already familiar with it (as it powers some parts of the infrastructure) and knew there would be obvious wins. In practice, we are shipping Docker containers in a main Vagrant image and drive some of the customization and upgrade with Ansible.

We’ll probably write something about this approach in the coming weeks, but given the amount of confusion there is around what these technologies are, and how they’re used, we thought we’d give you a quick tour on how to use them together.

Let’s get started.


You’ve probably heard about Vagrant; a healthy number of people have been writing about it in the past 6 months. For those of you who haven’t, think of it as a VM without the GUI. At its core, Vagrant is a simple wrapper around Virtualbox/VMware.

A few interesting features:

  • Boatloads of existing images, just check for example.
  • Snapshot and package your current machine to a Vagrant box file (and, consequently, share it back).
  • Ability to fine tune settings of the VM, including things like RAM, CPU, APIC
  • Vagrantfiles. This allows you to setup your box on init: installing packages, modifying configuration, moving code around
  • Integration with CM tools like Puppet, Chef and Ansible.

Let’s get it running on your machine:

  1. First, download Vagrant and VirtualBox.

  2. Second, let’s download an image, spin it up and SSH in:

     $ vagrant init precise64
     $ vagrant up
     $ vagrant ssh
  3. There’s no 3.

  4. There’s a 4 if you want to access your (soon to be) deployed app; you will need to dig around the Vagrant documentation to perform port forwarding, proper networking and update manually your Vagrantfile.


Docker is a Linux container, written in Go (yay!) and based on lxc (self-described as “chroot on steroids”) and AUFS. Instead of providing a full VM, like you get with Vagrant, Docker provides you lightweight containers, that share the same kernel and allow to safely execute independent processes.

Docker is attractive for many reasons:

  • Lightweight; images are much lighter than full VMs, and spinning off a new instance is lightning fast (in the range of seconds instead of minutes).
  • Version control of the images, which makes it much more convenient to handle builds.
  • Lots of images (again), just have a look at the docker public index of images.

Let’s set up a Docker container on your Vagrant machine:

  1. SSH in Vagrant if you’re not in already:

     $ vagrant ssh
  2. Install Docker, as explained on the official website:

     $ sudo apt-get update
     $ sudo apt-get install linux-image-generic-lts-raring linux-headers-generic-lts-raring
     $ sudo reboot
     $ sudo sh -c "curl | apt-key add -"
     $ sudo sh -c "echo deb docker main > /etc/apt/sources.list.d/docker.list"
     $ sudo apt-get update
     $ sudo apt-get install lxc-docker
  3. Verify it worked by trying to build your first container:

     $ sudo docker run -i -t ubuntu /bin/bash
  4. Great, but we’ll need more than a vanilla Linux. To add our dependencies, for example to run a Node.js + MongoDB app, we’re gonna start by creating a Dockerfile:

     FROM ubuntu
     # Fetch Nodejs from the official repo (binary .. no hassle to build, etc.)
     ADD /opt/
     # Untar and add to the PATH
     RUN cd /opt && tar xzf node-v0.10.19-linux-x64.tar.gz
     RUN ln -s /opt/node-v0.10.19-linux-x64 /opt/node
     RUN echo "export PATH=/opt/node/bin:$PATH" >> /etc/profile
     # A little cheat for upstart ;)
     RUN dpkg-divert --local --rename --add /sbin/initctl
     RUN ln -s /bin/true /sbin/initctl
     # Update apt sources list to fetch mongodb and a few key packages
     RUN echo "deb precise universe" >> /etc/apt/sources.list
     RUN apt-get update
     RUN apt-get install -y python git
     RUN apt-get install -y mongodb
     # Finally - we wanna be able to SSH in
     RUN apt-get install -y openssh-server
     RUN mkdir /var/run/sshd
     # And we want our SSH key to be added
     RUN mkdir /root/.ssh && chmod 700 /root/.ssh
     ADD /root/.ssh/authorized_keys
     RUN chmod 400 /root/.ssh/authorized_keys && chown root. /root/.ssh/authorized_keys
     # Expose a bunch of ports .. 22 for SSH and 3000 for our node app
     EXPOSE 22 3000
     ENTRYPOINT ["/usr/sbin/sshd", "-D"]
  5. Let’s build our image now:

     $ sudo docker build .
     # Missing file ... hahaha ! You need an ssh key for your vagrant user
     $ ssh-keygen
     $ cp -a /home/vagrant/.ssh/ .
     # Try again
     $ sudo docker build .
     # Great Success! High Five!
  6. Now, let’s spin off a container with that setup and log into it ($MY_NEW_IMAGE_ID is the last id the build process returned to you):

     $ sudo docker run -p 40022:22 -p 80:3000 -d $MY_NEW_IMAGE_ID
     $ ssh root@localhost -p 40022

You now have a Docker container, inside a Vagrant box (Inception style), ready to run a Node.js app.


Ansible is an orchestration and configuration management tool written in Python. If you want to learn more about Ansible (and you should…), we wrote about it a few weeks ago.

Let’s get to work. We’re now gonna deploy an app in our container:

  1. Install Ansible, as we showed you in our previous post.

  2. Prepare your inventory file (host):

     app ansible_ssh_host= ansible_ssh_port=40022
  3. Create a simple playbook to deploy our app (deploy.yml):

     - hosts: app
       user: root
         # Fetch the code from github
         - name: Ensure we got the App code
         # NPM may or may not succeed, if you give it time, care, etc. it eventually works
         - name: Ensure the npm dependencies are installed
             /opt/node/bin/npm install
           ignore_errors: yes
         # We will assume no changes in the default sample - or we should consider templates instead
         - name: Ensure the config files of the app
             cp /opt/node-express-mongoose-demo/config/$item.example.js /opt/node-express-mongoose-demo/config/$item.js
             - config
             - imager
         # `initctl` is now linking to `true` and we have no access to services
         # Need to fake the start
         - name: Ensure mongodb data folders
             - /var/lib/mongodb
             - /var/log/mongodb
         # Super cheat combo !
         - name: Ensure mongodb is running
             LC_ALL='C' /sbin/start-stop-daemon --background --start --quiet --chuid mongodb --exec  /usr/bin/mongod -- --config /etc/mongodb.conf
         # Cheating some more !
         - name: Ensure the App is running
             /opt/node/bin/npm start &
  4. Run that baby:

     $ ansible-playbook -i host deploy.yml
  5. We’re done, point your browser at http://localhost:80 - assuming you have performed the redirection mentioned in the initial setup of your vagrant box.

That’s it. You’ve just deployed your app on Docker (in Vagrant).

Let’s wrap it up

So we just saw (roughly) how these tools can be used, and how they can be complementary:

  1. Vagrant will provide you with a full VM, including the OS. It’s great at providing you a Linux environment for example when you’re on MacOS.
  2. Docker is a lightweight VM of some sort. It will allow you to build contained architectures faster and cheaper than with Vagrant.
  3. Ansible is what you’ll use to orchestrate and fine-tune things. That’s what you want to structure your deployment and orchestration strategy.

It takes a bit of reading to get more familiar with these tools, and we’ll likely follow up on this post in the next few weeks. However, especially as a small team, this kind of technology allows you to automate and commoditize huge parts of your development and ops workflows. We strongly encourage you to make that investment. It has helped us tremendously increase the pace and quality of our throughput.

ZooKeeper vs. Doozer vs. Etcd

While is fast approaching a public release, the team has been dealing with an increasingly complex infrastructure. We more recently faced an interesting issue; how do you share configuration across a cluster of servers? More importantly, how do you do so in a resilient, secure, easily deployable and speedy fashion?

That’s what got us to evaluate some of the options available out there; ZooKeeper, Doozer and etcd. These tools all solve similar sets of problems but their approach differ quite significantly. Since we spent some time evaluating them, we thought we’d share our findings.

ZooKeeper, the old dog

ZooKeeper is the most well known (and oldest) project we’ve looked into. It’s used by a few big players (Rackspace, Yahoo, eBay, Youtube) and is pretty mature.

It was created by Yahoo to deal with distributed systems applications. I strongly recommend you read the “making of” if you’re interested in understanding where Yahoo came from when they wrote it.

It stores variables in a structure similar to a file system, an approach that both Doozer and etcd still follow. With ZooKeeper, you maintain a cluster of servers communicating with each other that share the state of the distributed configuration data. Each cluster elects one “leader” and clients can connect to any of the servers within the cluster to retrieve the data. Zookeeper uses its own algorithm to handle distributed storage.

  • Pros:
    • Mature technology; it is used by some big players (eBay, Yahoo et al).
    • Feature-rich; lots of client bindings, tools, API
  • Cons:
    • Complex; ZooKeeper is not for the faint of heart. It is pretty heavy and will require you to maintain a fairly large stack.
    • It’s… Java; not that we especially hate Java, but it is on the heavy side and introduce a lot of dependencies. We wanted to keep our machines as lean as possible and usually shy away from dependency heavy technologies.
    • Apache; we have mixed feelings about the Apache Foundation. “Has Apache Lost Its Way?” summarizes it pretty well.

Doozer, kinda dead

Doozer was developed by Heroku a few years ago. It’s written in Go (yay!), which means it compiles into a single binary that runs without dependencies. On a side-note, if you’re writing code to manage infrastructure, you should spend some time learning Go.

Doozer got some initial excitement from the developer community but seems to have stalled more recently, with many forks being sporadically maintained and no active core development.

It is composed of a daemon and a client. Once you have at least one Doozer server up, you can add any number of servers and have clients get and set data by talking to any of the servers within that cluster.

It was one of the first practical implementations (as far as I know) of the Paxos algorithm). This means operations can be slow when compared to dealing with a straight database since cluster-wide consensus needs to be reached before committing any operation.

Doozer was a step in the right direction. It is simple to use and setup. However, after using it for a while we started noticing that a lot of its parts felt unfinished. Moreover, it wasn’t answering some of our needs very well (encryption and ACL).

  • Pros:
    • Easy to deploy, setup and use (Go, yay!)
    • It works; lots of people have actually used it in production.
  • Cons:
    • Pretty much dead: the core project hasn’t been active in a while (1 commit since May) and is pretty fragmented (150 forks…).
    • Security; no encryption and a fairly simple secure-word based authentication.
    • No ACL; and we badly needed this.


After experiencing the shortcomings of Doozer, we stumbled upon a new distributed configuration storage called etcd. It was first released by the CoreOS team a month ago.

Etcd and Doozer look pretty similar, at least on the surface. The most obvious technical difference is that ectd uses the Raft algorithm instead of Paxos. Raft is designed to be simpler and easier to implement than Paxos.

Etcd’s architecture is similar to Doozer’s. It does, however, store data persistently (writes log and snapshots), which was of value to us for some edge cases. It also has a better take on security, with CA’s, certs and private keys. While setting it up is not straightforward, it adds conveniency and safety of mind.

Beyond the fact that it answered some of our more advanced needs, we were seduced (and impressed) by the development pace of the project.

  • Pros:
    • Easy to deploy, setup and use (yay Go and yay HTTP interfaces!).
    • Data persistence.
    • Secure: encryption and authentication by private keys.
    • Good documentation (if a little bit obscure at times).
    • Planned ACL implementation.
  • Cons:
    • (Very) young project; interfaces are still moving pretty quickly.
    • Still not a perfect match, especially in the way that data is spread.

The DIY approach (yeah, right..?)

It is only fair that technical teams may rely on their understanding of their infrastructure and coding skills to get something that just works™ in place. We haven’t seriously considered this approach as we felt that getting security and distributed state sharing right was going to be a bigger endeavor than we could afford (the backlog is full enough for now).


In the end, we decided to give etcd a try. So far it seems to work well for our needs and the very active development pace seems to validate our choice. It has proven resilient and will likely hold well until we have the resources to either customize its data propagation approach, or build our own solution that will answer some needs it is not likely to answer (we’ve already looked into doing so with ZeroMQ and Go).

Code Reuse With Node.js

Code recycling

Any project that grows to a decent size will need to re-use parts of its code extensively. That often means, through the development cycle, a fair amount of rewrites and refactoring exercises. Elegant code re-use is hard to pull off.

With node.js, which we use quite a bit at, the most common ways to do this often rely on prototype or class inheritance. The problem is, as the inheritance chain grows, managing attributes and functions can become quite complex.

The truth is, people usually just need the objects. This led us to adopt a certain form of object-based prototyping. We believe it to be leaner and more straightforward in most cases. But before we get there, let’s have a look at how people usually approach this issue.

The “Function copy”

Usually in the form of this[key] = that[key]. A quick example:

var objectA = {
    lorem: 'lorem ipsum'
var objectB = {};

// Direct copy of a string, but you get the idea
objectB.lorem = objectA.lorem;
console.log(objectB); // Will output: { lorem: 'lorem ipsum' }

Crude, but it works. Next


The previous method may work with simple structures, but it won’t hold when your use cases become more complex. That’s when I usually call my buddy Object.defineProperties():

var descriptor = Object.getOwnPropertyDescriptor;
var defineProp = Object.defineProperty;

var objectA = {};
var objectB = {};
var objectC = {};

objectA.__defineGetter__('lorem', function() {
    return 'lorem ipsum';
console.log(objectA); // Will output: { lorem: [Getter] }

// Direct copy, which copies the result of the getter.
objectB.lorem = objectA.lorem;
console.log(objectB); // Will output: { lorem: 'lorem ipsum' }

// Copying with Object.defineProperty(), and it copies the getter itself.
defineProp(objectC, 'lorem', descriptor(objectA, 'lorem'));
console.log(objectC); // Will output: { lorem: [Getter] }

I often use a library for that. A couple examples (more or less the same stuff with different coding styles):

  1. es5-ext

     var extend = require('es5-ext/lib/Object/extend-properties');
     var objectA = {};
     var objectC = {};
     objectA.__defineGetter__('lorem', function() {
         return 'lorem ipsum';
     extend(objectC, objectA);
     console.log(objectC); // Will output: { lorem: [Getter] }
  2. Carcass

     var carcass = require('carcass');
     var objectA = {};
     var objectC = {};
     objectA.__defineGetter__('lorem', function() {
         return 'lorem ipsum';
     console.log(objectC); // Will output: { mixin: [Function: mixin], lorem: [Getter] }

Slightly better, but not optimal. Now, let’s see what we end up doing more and more often:

Prototyping through objects

The basic idea is that we prepare some functions, wrap them into an object which then becomes a “feature”. That feature can then be re-used by simply merging it with the targeted structure (object or prototype).

Let’s take the example of the loaderSync script in Carcass:

module.exports = {
    source: source,
    parser: parser,
    reload: reload,
    get: get

function get() {


Once you copy the functions to an object, this object becomes a “loader” that can load a “source” synchronously with a “parser”. A “source” can be a file path and the “parser” can be simply Node.js’ require function.

Let’s now see how to use this with a couple object builders. Once again, I’ll borrow an example from Carcass; the loaderSync benchmark script. The first builder generates a function and copies the methods from what we’ve prepared. The second one copies the methods to the prototype of a builder class:


function LoaderA(_source) {
    function loader() {
        return loader.get();
    loader.mixin = mixin;
    return loader;


function LoaderC(_source) {
    if (!(this instanceof LoaderC)) return new LoaderC(_source);
LoaderC.prototype.mixin = mixin;


Here we can see the two approaches. Let’s compare them quickly:

FeatureLoader ALoader C
Instantiatingvar a = LoaderA(...)var c = LoaderC(...) or var c = new LoaderC(...)
AppearanceGenerates a functionBuilds a typical instance which is an object.
Invoking directlya() or a.get()c.get()
Invoking as a callbackipsum(a)ipsum(c.get.bind(c))
Performance of instantiating-100x faster
Performance of invokingidemidem

: (check it yourself by benchmarking Carcass with make bm)

“Protos” and beyond

That last approach is gaining traction among our team; we prepare functions for our object builders (which, by the way, we call “protos”). While we still choose to use prototypes in some occurrences, it is mainly because it is faster to get done. For the sake of convenience, we also sometimes rely on functions rather than objects to invoke our “protos”, however keep in mind that this is a performance trade-off.

I’ll wrap this up mentioning one more method we use, admittedly less often: “Object alter”. The idea is to rely on an “alter” function designed to change objects passed to it. This is sometimes also called a “mixin”. An example from vsionmedia’s trove of awesomeness on Github:


module.exports = function(obj){

    obj.settings = {};

    obj.set = function(name, val){


    return obj;


Ansible Simply Kicks Ass

The team has been putting quite a few tools to the test over the years when it comes to managing infrastructures. We’ve developed some ourselves and have adopted others. While the choice to use one over another is not always as clear-cut as we’d like (I’d love to rant about monitoring but will leave that for a later post), we’ve definitely developed kind of a crush for Ansible in the past 6 months. We went through years of using Puppet, then Chef and more recently Salt Stack, before Ansible gained unanimous adoption among our team.

What makes it awesome? Well, on top of my head:

  • It’s agent-less and works by default in push mode (that last point is subjective, I know).
  • It’s easy to pick up (honestly, try and explain Chef or Puppet to a developer and see how long that takes you compared to Ansible).
  • It’s just Python. It makes it easier for people like me to contribute (Ruby is not necessarily that mainstream among ops) and also means minimal dependency on install (Python is shipped by default with Linux).
  • It’s picking up steam at an impressive pace (I believe we’re at 10 to 15 pull requests a day).
  • And it has all of the good stuff: idempotence, roles, playbooks, tasks, handlers, lookups, callback plugins

Now, Ansible is still very much in its infancy and some technologies may not yet be supported. But there are a great deal of teams pushing hard on contributions, including us. In the past few weeks, for example, we’ve contributed both Digital Ocean and Linode modules. And we have a lot more coming, including some experimentations with Vagrant.

Now, an interesting aspect of Ansible, and one that makes it so simple, is that it comes by default with a tool-belt. Understand that it is shipped with a range of modules that add support for well known technologies: EC2, Rackspace, MySQL, PostgreSQL, rpm, apt,. This now includes our Linode contribution. That means that with the latest version of Ansible you can spin off a new Linode box as easily as:

ansible all -m linode -a "name='my-linode-box' plan=1 datacenter=2 distribution=99 password='p@ssword' "

Doing this with Chef would probably mean chasing down a knife plugin for adding Linode support, and would simply require a full Chef stack (say hello to RabbitMQ, Solr, CouchDB and a gazillion smaller dependencies). Getting Ansible up and running is as easy as:

pip install ansible

Et voila! You gotta appreciate the simple things in life. Especially the life of a sysadmin.

Goodbye node-forever, hello PM2

pm2 logo

It’s no secret that the team has a crush on Javascript; node.js in the backend, AngularJS for our clients, there isn’t much of our stack that isn’t at least in part built with it. Our approach of building static clients and RESTful JSON APIs means that we run a lot of node.js and I must admit that, despite all of it awesomeness, node.js still is a bit of a headache when it comes to running in production. Tooling and best practices (think monitoring, logging, error traces…) are still lacking when compared to some of the more established languages.

So far, we had been relying on the pretty nifty node-forever. Great tool, but a few things were missing:

  • Limited monitoring and logging abilities,
  • Poor support for process management configuration,
  • No support for clusterization,
  • Aging codebase (which meant frequent failures when upgrading Node).

This is what led us to write PM2 in the past couple months. We thought we’d give you a quick look at it while we’re nearing a production ready release.

So what’s in the box?

First things first, you can install it with npm:

npm install -g pm2

Let’s open things up with the usual comparison table:

Keep Alive
Log aggregation
Terminal monitoring
JSON configuration

And now let me geek a tad more about the main features

Native clusterization

Node v0.6 introduced the cluster feature, allowing you to share a socket across multiple networked Node applications. Problem is, it doesn’t work out of the box and requires some tweaking to handle master and children processes.

PM2 handles this natively, without any extra code: PM2 itself will act as the master process and wrap your code into a special clustered process, as Nodejs does, to add some global variables to your files.

To start a clustered app using all the CPUs you just need to type something like that:

$ pm2 start app.js -i max


$ pm2 list

Which should display something like (ASCII UI FTW);

pm2 list

As you can see, your app is now forked into multiple processes depending on the number of CPUs available.

Monitoring a la termcaps-HTOP

It’s nice enough to have an overview of the running processes and their status with the pm2 list command. But what about tracking their resources consumption? Fear not:

$ pm2 monit

You should get the CPU usage and memory consumption by process (and cluster).

pm2 monit

Disclaimer: node-usage doesn’t support MacOS for now (feel free to PR). It works just fine on Linux though.

Now, what about checking on our clusters and GC cleaning of the memory stack? Let’s consider you already have an HTTP benchmark tool (if not, you should definitely check WRK):

$ express bufallo     // Create an express app
$ cd bufallo
$ npm install
$ pm2 start app.js -i max
$ wrk -c 100 -d 100 http://localhost:3000/

In another terminal, launch the monitoring option:

$ pm2 monit


Realtime log aggregation

Now you have to manage multiple clustered processes: one who’s crawling data, another who is processing stuff, and so on so forth. That means logs, lots of it. You can still handle it the old fashioned way:

$ tail -f /path/to/log1 /path/to/log2 ...

But we’re nice, so we wrote the logs feature:

$ pm2 logs

pm2 monit


So things are nice and dandy, your processes are humming and you need to do a hard restart. What now? Well, first, dump things:

$ pm2 dump

From there, you should be able to resurrect things from file:

$ pm2 kill     // let's simulate a pm2 stop
$ pm2 resurect // All my processes are now up and running 

API Health point

Let’s say you want to monitor all the processes managed by PM2, as well as the status of the machine they run on (and maybe even build a nice Angular app to consume this API…):

$ pm2 web

Point your browser at http://localhost:9615, aaaaand… done!

And there’s more

  • Full tests,
  • Generation of update-rc.d (pm2 startup), though still very alpha,
  • Development mode with auto restart on file change (pm2 dev), still very drafty too,
  • Log flushing,
  • Management of your applications fleet via JSON file,
  • Log uncaught exceptions in error logs,
  • Log of restart count and time,
  • Automated killing of processes exiting too fast.

What’s next?

Well first, you could show your love on Github (we love stars):

We developed PM2 to offer an advanced and complete solution for Node process management. We’re looking forward to getting more people helping us getting there: pull requests are more than welcome. A few things already on the roadmap that we’ll get right at once we have a stable core:

  • Remote administration/status checking,
  • Built-in inter-processes communication channel (message bus),
  • V8 GC memory leak detection,
  • Web interface,
  • Persistent storage for monitoring data,
  • Email notifications.

Special thanks to Makara Wang for concepts/tools and Alex Kocharin for advices and pull requests.

Automation And Friction

I’ll admit that the team is a lazy bunch; we like to forget about things, especially the hard stuff. Dealing with a complex process invariably leads one of us to vent about how “we should automate that stuff”. That’s what our team does day and night:

  1. Dumb things down, lower barriers of entry, and then
  2. Automate all the things!

This has transpired through every layer of our company, from engineering to operations. Recently we’ve started pushing on a third point, but first let me rant a bit

The ever increasing surface of friction

The past few years have seen a healthy push on UI and UX. Even developer tools and enterprise software, historically less user-friendly, have started adopting that trend. We now have things like Github. Great.

This trend grew in parallel with the adoption of SaaS. SaaS are the results of teams focused on specific problems, with the user experience often being a key component (not to undervalue good engineering). It’s pretty standard for these services to offer an API for integration’s sake. Our CRM plays nicely with Dropbox, GMail and a gazillion other services. Again, great.

However, the success of SaaS means the surface of interfaces we’re dealing with is constantly stretching. This is far more difficult to overcome than poor UI or UX. Many of us have witnessed teams struggling to get adoption on a great tool that happen to be one too many. There’s not much you can do about it.

A bot to rule them all

Borat is omnipotent

Our team has tried a lot of different approaches over the years. We kicked the tires on a lot of products and ended up doing as usual:

  1. Simplify. For example, we use Github to manage most tasks and discussions, including operations (HR, admin, …), and marketing. We used Trello alongside Github for a while and we loved it. But it silo-ed the discussions. Everything from our employee handbook to tasks for buying snacks for the office are now on Github. It also had an interesting side effect on transparency, but I’ll talk about this another time.

  2. Automate. We automate pretty much everything we can. When you apply to one of our job by email for example, we push the attachments in Dropbox (likely your resume) and create a ticket with the relevant information on Github. Zapier is great for this kind of stuff by the way.

  3. Make it accessible. That’s the most important point for us at this stage. Borat, our Hubot chat bot, is hooked up with most of our infrastructure and is able to pass on requests to the services we use as well as some of our automation. If one of us is awake, chances are you can find us on the chat, making it the most ubiquitous interface for our team:

  • Need to deploy some code on production or modify some configuration on a server? Ask Borat, he’ll relay your demands to the API.
  • Your latest commit broke the build? A new mail came from support? Expect to hear about it from Borat.
  • Need to use our time tracker? Just drop a message to the bot when you’re starting your task and let him know when you’re done.
  • Need to call for a SCRUM? Just mention the Github team you want to chat with and Borat will create a separate channel and invite the right people to join.
  • Somebody is at the door? Ask the bot to open it for you (you gotta love hacking on Raspberry PI).

Anybody with access to our bot’s repository can add a script to hook him up to a new service. Git push, kill the bot and wait for him to come back to life with new skills. The tedious stuff ends up sooner or later scripted and one sentence away.

Really, try it. It’s worth the investment.

I Can Haz Init Script

Something went awfully wrong, and a rogue process is eating up all of the resources on one of your servers. You have no other choice but to restart it. No big deal, really; this is the age of disposable infrastructure after all. Except when it comes back up, everything starts going awry. Half the stuff supposed to be running is down and it’s screwing with the rest of your setup.


You don’t get to think about them very often, but init scripts are a key piece of a sound, scalable strategy for your infrastructure. It’s a mandatory best practice. Period. And there are quite a few things in the way of getting them to work properly at scale in production environments. It’s a tough world out there.

What we’re dealing with


Often enough, you’re gonna end up installing a service using the package manager of your distro: yum, apt-get, you name it. These packages usually come with an init script that should get you started.

Sadly, as your architecture grows in complexity, you’ll probably run into some walls. Wanna have multiple memcache buckets, or several instances of redis running on the same box? You’re out of luck buddy. Time to hack your way through:

  • Redefine your start logic,
  • Load one or multiple config files from /etc/defaults or /etc/sysconfig,
  • Deal with the PIDs, log and lock files,
  • Implement conditional logic to start/stop/restart one or more of the services,
  • Realize you’ve messed something up,
  • Same player shoot again.

Honestly: PITA.

Built from source

First things first: you shouldn’t be building from source (unless you really, really need to).

Now if you do, you’ll have to be thorough: there may be samples of init scripts in there, but you’ll have to dig them out. /contrib, /addons, … it’s never in the same place.

And that makes things “fun” when you’re trying to unscrew things on a box:

  • You figured out that MySQL is running from /home/user/src/mysql,
  • You check if there’s an init script: no luck this time
  • You try to understand what exactly launched mysqld_safe,
  • You spend a while digging into the bash history smiling at typos,
  • You stumble on a script (uncommented, of course) in the home directory. Funny enough, it seems to be starting everything from MySQL, NGINX and php-fpm to the coffee maker.
  • You make a mental note to try and track down the “genius” who did that mess of a job, and get busy with converting everything to a proper init script.


Why existing solutions suck

Well, based on what we’ve just seen, you really only have two options:

  1. DIY; but if you’re good at what you do, you’re probably also lazy. You may do it the first couple times, but that’s not gonna scale, especially when dealing with the various flavors of init daemons (upstart, systemd…),
  2. Use that thing called “the Internet”; you read through forum pages, issue queues, gists and if you’re lucky you’ll find a perfect one (or more likely 10 sucky ones). Kudos to all those of whom shared their work, but you’ll probably be back to option 1.

We can do better than this

You’ll find a gazillion websites for pictures of kittens, but as far as I know, there is no authoritative source for init scripts. That’s just not right: we have to fix it. A few things I’m aiming for:

  • Scalable; allow for multiple instances of a service to be started at once from different config files (see the memcache/redis example),
  • Secure; ensure configtest is run before a restart/reload (because, you know, a faulty config file preventing the service to restart is kind of a bummer),
  • Smart; ensuring for example that the cache is aggressively flushed before restarting your database (so that you don’t end-up waiting 50 min for the DB to cleanly shutdown).

I’ve just created a repo where I’ll be dumping various init scripts that will hopefully be helpful to others. I’d love to get suggestions or help.

And by the way, things are not much better with applications, though we’re trying our best to improve things there too with things like pm2 (fresh and shinny, more about it in a later post).

Shanghai Hacker News Meetup Reboot

We’ll be having our usual Hacker News meetup at Abbey Road (45 Yueyang road, near Hengshan Lu) tonight starting 7:00 PM: come and meet entrepreneurs, technologists and likeminded individuals while sharing a couple drinks. The first round of drinks is on Wiredcraft.

Starting next week, we’ll be changing a bit the format of this event:

  • Hacker News dinner, every Tuesday from 7:00 to 8:00 PM; we’re still confirming the venue. We’ll be kicking off each dinner with a video of a tech or entrepreneurial talk (that you’re free to ignore if you’d rather chat with other attendees).
  • Hacker News meetup, every first Tuesday of the month from 6:30 to 10:00 PM (instead of the regular dinner); a longer event with an actual speaker on stage. We’re still working out the details, but we hope to sponsor some food and drinks with the help of Wiredcraft and hopefully other local startups.

We’ll be posting more information soon on the Shanghai Hacker News meetup website. Don’t hesitate to shoot us an email if you want to help out or sponsor.

Designing A RESTful API That Doesn't Suck

As we’re getting closer to shipping the first version of and we are joined by a few new team members, the team took the time to review the few principles we followed when designing our RESTful JSON API. A lot of these can be found on apigee’s blog (a recommended read). Let me give you the gist of it:

  • Design your API for developers first, they are the main users. In that respect, simplicity and intuitivity matter.
  • Use HTTP verbs instead of relying on parameters (e.g. ?action=create). HTTP verbs map nicely with CRUD:
    • POST for create,
    • GET for read,
    • DELETE for remove,
    • PUT for update (and PATCH too).
  • Use HTTP status codes, especially for errors (authentication required, error on the server side, incorrect parameters)… There are plenty to choose from, here are a few:
    • 200: OK
    • 201: Created
    • 304: Not Modified
    • 400: Bad Request
    • 401: Unauthorized
    • 403: Forbidden
    • 404: Not Found
    • 500: Internal Server Error
  • Simple URLs for resources: first a noun for the collection, then the item. For example /emails and /emails/1234; the former gives you the collection of emails, the second one a specific one identified by its internal id.
  • Use verbs for special actions. For example, /search?q=my+keywords.
  • Keep errors simple but verbose (and use HTTP codes). We only send something like { message: "Something terribly wrong happened" } with the proper status code (e.g. 401 if the call requires authentication) and log more verbose information (origin, error code…) in the backend for debugging and monitoring.

Relying on HTTP status codes and verbs should already help you keep your API calls and responses lean enough. Less crucial, but still useful:

  • JSON first, then extend to other formats if needed and if time permits.
  • Unix time, or you’ll have a bad time.
  • Prepend your URLs with the API version, like /v1/emails/1234.
  • Lowercase everywhere in URLs.

Shanghai Open Source Meetup

Shanghai OS meetup

The March edition of the Shanghai Open Source meetup will happen at a new location near People Square. There are several well equipped rooms and we’re pretty excited to get started with the new format: 1 presentation of 20 to 30 minutes followed by a few workshops.

We’re looking for people to speak and lead the workshops on any Open Source technology of your choice. We’ll be leading a few ones ourselves with the help of the Wiredcraft folks. Expect some Chef, Sensu, Ansible, node.js, marionette.js, Raspberry PI and other geek galore.

Check out the official website for more information (there’s an english version) and don’t hesitate to get in touch with us if you want to present, lead a workshop or help organizing.

See you there in a couple weeks.

First 5 Minutes Troubleshooting A Server

Back when our team was dealing with operations, optimization and scalability at our previous company, we had our fair share of troubleshooting poorly performing applications and infrastructures of various sizes, often large (think CNN or the World Bank). Tight deadlines, “exotic” technical stacks and lack of information usually made for memorable experiences.

The cause of the issues was rarely obvious: here are a few things we usually got started with.

Get some context

Don’t rush on the servers just yet, you need to figure out how much is already known about the server and the specifics of the issues. You don’t want to waste your time (trouble) shooting in the dark.

A few “must have”:

  • What exactly are the symptoms of the issue? Unresponsiveness? Errors?
  • When did the problem start being noticed?
  • Is it reproducible?
  • Any pattern (e.g. happens every hour)?
  • What were the latest changes on the platform (code, servers, stack)?
  • Does it affect a specific user segment (logged in, logged out, geographically located…)?
  • Is there any documentation for the architecture (physical and logical)?
  • Is there a monitoring platform? Munin, Zabbix, Nagios, New Relic… Anything will do.
  • Any (centralized) logs?. Loggly, Airbrake, Graylog

The last two ones are the most convenient sources of information, but don’t expect too much: they’re also the ones usually painfully absent. Tough luck, make a note to get this corrected and move on.

Who’s there?

$ w
$ last

Not critical, but you’d rather not be troubleshooting a platform others are playing with. One cook in the kitchen is enough.

What was previously done?

$ history

Always a good thing to look at; combined with the knowledge of who was on the box earlier on. Be responsible by all means, being admin shouldn’t allow you to break ones privacy.

A quick mental note for later, you may want to update the environment variable HISTTIMEFORMAT to keep track of the time those commands were ran. Nothing is more frustrating than investigating an outdated list of commands

What is running?

$ pstree -a
$ ps aux

While ps aux tends to be pretty verbose, pstree -a gives you a nice condensed view of what is running and who called what.

Listening services

$ netstat -ntlp
$ netstat -nulp
$ netstat -nxlp

I tend to prefer running them separately, mainly because I don’t like looking at all the services at the same time. netstat -nalp will do to though. Even then, I’d ommit the numeric option (IPs are more readable IMHO).

Identify the running services and whether they’re expected to be running or not. Look for the various listening ports. You can always match the PID of the process with the output of ps aux; this can be quite useful especially when you end up with 2 or 3 Java or Erlang processes running concurrently.

We usual prefer to have more or less specialized boxes, with a low number of services running on each one of them. If you see 3 dozens of listening ports you probably should make a mental note of investigating this further and see what can be cleaned up or reorganized.


$ free -m
$ uptime
$ top
$ htop

This should answer a few questions:

  • Any free RAM? Is it swapping?
  • Is there still some CPU left? How many CPU cores are available on the server? Is one of them overloaded?
  • What is causing the most load on the box? What is the load average?


$ lspci
$ dmidecode
$ ethtool

There are still a lot of bare-metal servers out there, this should help with;

  • Identifying the RAID card (with BBU?), the CPU, the available memory slots. This may give you some hints on potential issues and/or performance improvements.
  • Is your NIC properly set? Are you running in half-duplex? In 10MBps? Any TX/RX errors?

IO Performances

$ iostat -kx 2
$ vmstat 2 10
$ mpstat 2 10
$ dstat --top-io --top-bio

Very useful commands to analyze the overall performances of your backend;

  • Checking the disk usage: has the box a filesystem/disk with 100% disk usage?
  • Is the swap currently in use (si/so)?
  • What is using the CPU: system? User? Stolen (VM)?
  • dstat is my all-time favorite. What is using the IO? Is MySQL sucking up the resources? Is it your PHP processes?

Mount points and filesystems

$ mount
$ cat /etc/fstab
$ vgs
$ pvs
$ lvs
$ df -h
$ lsof +D / /* beware not to kill your box */
  • How many filesystems are mounted?
  • Is there a dedicated filesystem for some of the services? (MySQL by any chance..?)
  • What are the filesystem mount options: noatime? default? Have some filesystem been re-mounted as read-only?
  • Do you have any disk space left?
  • Is there any big (deleted) files that haven’t been flushed yet?
  • Do you have room to extend a partition if disk space is an issue?

Kernel, interrupts and network usage

$ sysctl -a | grep ...
$ cat /proc/interrupts
$ cat /proc/net/ip_conntrack /* may take some time on busy servers */
$ netstat
$ ss -s
  • Are your IRQ properly balanced across the CPU? Or is one of the core overloaded because of network interrupts, raid card, …?
  • How much is swappinness set to? 60 is good enough for workstations, but when it come to servers this is generally a bad idea: you do not want your server to swap… ever. Otherwise your swapping process will be locked while data is read/written to the disk.
  • Is conntrack_max set to a high enough number to handle your traffic?
  • How long do you maintain TCP connections in the various states (TIME_WAIT, …)?
  • netstat can be a bit slow to display all the existing connections, you may want to use ss instead to get a summary.

Have a look at Linux TCP tuning for some more pointer as to how to tune your network stack.

System logs and kernel messages

$ dmesg
$ less /var/log/messages
$ less /var/log/secure
$ less /var/log/auth
  • Look for any error or warning messages; is it spitting issues about the number of connections in your conntrack being too high?
  • Do you see any hardware error, or filesystem error?
  • Can you correlate the time from those events with the information provided beforehand?


$ ls /etc/cron* + cat
$ for user in $(cat /etc/passwd | cut -f1 -d:); do crontab -l -u $user; done
  • Is there any cron job that is running too often?
  • Is there some user’s cron that is “hidden” to the common eyes?
  • Was there a backup of some sort running at the time of the issue?

Application logs

There is a lot to analyze here, but it’s unlikely you’ll have time to be exhaustive at first. Focus on the obvious ones, for example in the case of a LAMP stack:

  • Apache & Nginx; chase down access and error logs, look for 5xx errors, look for possible limit_zone errors.
  • MySQL; look for errors in the mysql.log, trace of corrupted tables, innodb repair process in progress. Looks for slow logs and define if there is disk/index/query issues.
  • PHP-FPM; if you have php-slow logs on, dig in and try to find errors (php, mysql, memcache, …). If not, set it on.
  • Varnish; in varnishlog and varnishstat, check your hit/miss ratio. Are you missing some rules in your config that let end-users hit your backend instead?
  • HA-Proxy; what is your backend status? Are your health-checks successful? Do you hit your max queue size on the frontend or your backends?


After these first 5 minutes (give or take 10 minutes) you should have a better understanding of:

  • What is running.
  • Whether the issue seems to be related to IO/hardware/networking or configuration (bad code, kernel tuning, …).
  • Whether there’s a pattern you recognize: for example a bad use of the DB indexes, or too many apache workers.

You may even have found the actual root cause. If not, you should be in a good place to start digging further, with the knowledge that you’ve covered the obvious.

Shanghai JS Meetup

Shanghai JS

We’ll be hosting the Shanghai Javascript meetup tomorrow at Wiredcraft’s office starting 1:30 PM. More information can be found on the Shanghai Open Source website.

It will be more of an informal one to help us gather feedback from the community and plan for the March edition that will take place in a much larger venue in People Square. Don’t miss it if you want to get involved. The format of the event will be the same as next month;

  • 1:30 PM to 2:00 PM, Welcoming attendants
  • 2:00 PM to 2:30 PM, Presentation from Makara Wang, “Asynchronous JS with Promise”
  • 2:30 PM to 3:00 PM, Q&A
  • 3:00 PM to 3:20 PM, break
  • From 3:20 PM, workshops: Marionette.js, Angular.js,

See you there!

Shanghai Hacker News Meetup

Maneki-neko staring contest

As usual, we’ll be going to the monthly Hacker News meetup thrown by Wiredcraft for all our hacker friends out there in Shanghai. If you’re into technology, entrepreneurship or simply looking for an interesting discussion, join us at Abbey Road (45 Yueyang road, near Hengshan Lu) tomorrow starting 7:00 PM. Look for the table with a maneki-neko (the lucky cat Vincent holds so graciously in the picture above).

More details on the Shanghai Hacker News meetup website.

Best Practices: It's Always Or Never ( And Preferably Always)

Messy cables

It’s Monday morning. The development team needs a box and you’re already contemplating the gazillion other urgent tasks that need to be done on the existing infrastructure. Just that one timeTM, you’re going to forget about your own rules. You’re just gonna spawn an instance, set up the few services needed and be done with it. You’ll drop some of the usual time suckers: backup strategy, access rules, init scripts, documentation… You can’t just do the whole of it AND handle the rest of your day-to-day responsibilities. After all, it’s just a development server and you’ll probably fold it in a couple weeks, or you’ll clean it up once your plate is a tad less full.

A few weeks later, the box is still there and your backlog is far from looking less crowded. The development team just rolled out their production application on the same box. And things start crashing… badly.

After a couple of not so courteous emails from the dev team mentioning repetitive crashes, you log in the box and the fun starts. You can’t figure out what services have been deployed, or how exactly they were installed. You can’t restore the database because you don’t know where the bloody backups are. You waste time to find out that CouchDB wasn’t started at boot. All of this while receiving emails of “encouragement” from your colleagues.

Just because of that “one time”. Except that it’s never just that one time.

Best practices are not freaking optional

I hear you: coming up with these best practices and sticking to it systematically is hard. It’s high investment. But based on our common experience, it’s one you can’t afford not making. The “quick and dirty that one time” approach will ultimately fail you.

A few things you should never consider skipping:

  • Document the hell out of everything as you go. You probably won’t have time to get it done once you shipped it, and you probably won’t remember what you did or why you did it in a few weeks from now. Your colleagues will probably appreciate too.

  • Off-site backups for everything. Don’t even think of keeping your backups on the same physical box. Disks fail (a lot) and storage like S3/Glacier is dirt cheap. Find out a way to backup your code and data and stick to it.

  • Full setup and reliable sources. Avoid random AWS AMIs or RPM repositories. And when settings things up, go through the whole shebang: init script, dedicated running user, environment variables and such are not optional. Some of us also think that you shouldn't use rc.local for your Web services ever again.

Infrastructure As Code And Automation

Obviously, given what we’re working on at, we’re pretty strong adopters of infrastructure as code and automation. What tools to use is a much larger discussion. Go have a look at the comments on the announcement of the new version of Chef to get an idea of what’s out there.

Ultimately these are just opinions, but behind them are concepts worth investing in. Capturing the work you do on your infrastructure in repeatable and testable code, and automating as much as you can helps removing yourself from the equation. Doing so is helping you to reduce the human factor and free yourself of the repetitive boilerplate while you focus on the challenging tasks that only a creative brain can tackle.

Not building upon best practices is simply not an option. By doing so, you fail at investing in the foundation for a more robust infrastructure, and more importantly it is depriving you from scaling yourself.

Picture from comedy_nose

Why We Dropped Swagger And I/O Docs

As we started investing in our new strategy at my previous company, we looked around for solutions to document APIs. It may not be the sexiest part of the project, but documentation is the first step to designing a good API. And I mean first as in “before you even start writing tests” (yes, you should be writing tests first too).

We originally went with a simple Wiki page on Github, which served us just fine in the past. But it quickly became clear that it wasn’t going to cut it. We started thinking about what good documentations is. We’re fans of the single page approach that the Backbone.js documentation illustrates well and clearly remembered Github and Stripe as easy and well organized resources. Some Googling later, we were contemplating Wordnik’s Swagger and Mashery’s I/O Docs. We later settled for I/O Docs as it is built with node.js and was more straightforward to set up (for us at least).

Once again, we hit a wall with this approach:

  1. No proper support for JSON body: we don’t do much with parameters and mostly send JSON objects in the body of our requests, using HTTP verbs for the different types of operations we perform on our collections and models in the backend. Swagger and I/O Docs fall short of support for it, letting you simply dump your JSON in a field: not ideal.

  2. You’re querying the actual API: to be fair, this is an intentional feature. Now some of you may find it interesting that your documentation allows users to easily run calls against your API. That’s what Flickr does with their API explorer, and we used to think it was pretty neat. But once we started using it, we saw the risks of exposing so casually API calls that can impact your platform (especially with which deals with your actual infrastructure). I guess you could set up a testing API for that very purpose, but that’s quite a bit of added complexity (and we’re lazy).

And that’s how we ended up putting together Carte, a very lightweight Jekyll-based solution: drop a new post for each API call, following some loose format and specifying a few bits of meta data in the YAML header (type of the method, path…) and you’re good to go.

Screenshot of Carte

We’re real suckers for Jekyll. We’ve actually used it to build quite a few static clients for our APIs. One of the advantages of this approach is that we can bundle our documentation with our codebase by simply pushing it on the gh-pages branch, and it pops up as a Github page. That’s tremendously important for us as it make it very easy for developers to keep the documentation and the code in synch.

Carte is intentionally crude: have a look at the README and hack at will. Drop us a shout at @devo_ps if you need help or want to suggest a feature.

Farewell to Regular Web Development Approaches

At my previous company, we built Web applications for medium to large organizations, often in the humanitarian and non-profit space, facing original problems revolving around data. Things like building the voting infrastructure for the Southern Sudan Referendum helped us diversify our technical chops. But until a year ago, we were still mostly building regular Web applications; the user requests a page that we build and serve back.

Until we started falling in love with APIs and static clients.

It’s not that we fundamentally disliked the previous approach. We just reached a point where we felt our goals weren’t best served by this model. With lots of dynamic data, complex visualizations and a set of “static” interfaces, the traditional way was hindering our development speed and our ability to experiment. And so we got busy, experimenting at first with smaller parts of our projects (blog, help section, download pages…). We realized our use of complex pieces of softwares like content management systems had seriously biased our approach to problem solving. The CMS had become a boilerplate, an unchallenged dependency.

We’ve gradually adopted a pattern of building front-ends as static clients (may they be Web, mobile or 3rd party integrations) combined with, usually, one RESTful JSON API in the backend. And it works marvelously, thanks in part to some awesome tech much smarter people figured out for us:

Most of what we build at is stemming from this accelerated recovery and follow usually that order:

  1. We start by defining our API interface through user stories and methods,
  2. Both backend and front-end teams are getting cranking on implementing and solving the challenges specific to their part,

A lot of things happen in parallel and changes on one side rarely impact the other: we can make drastic changes in the UI without any change on the backend. And there were a lot of unexpected gains too, in security, speed and overall maintainability. More importantly, we’ve freed a lot of our resources to focus on building compelling user experiences instead of fighting a large piece of software to do what we want it to do.

If you haven’t tried building things that way yet, give it a try honestly (when relevant); it may be a larger initial investment in some cases but you’ll come out on top at almost every level.

San Francisco: Here We Come!

It’s been exactly a week since I landed in San Francisco: quite a bit happened in the mere few days I’ve spent here:

  1. I attended Startup School and the reception prior to this at YCombinator’s office.

    Startup School

  2. I settled in my new place in the mission It’s official, is now headquartered in San Francisco. HQ

  3. We just submitted our application to YCombinator. We’re trying to not get our hopes too high. That being said, no matter what the outcome is we found the exercise very interesting; I’d recommend any team working on a product to try and fill in the application form. Sequoia Capital has a set of guidelines for business plans that are also very useful.

    DevOps intro on youtube

We’re still a few weeks away from our first beta but have already been welcoming several clients and talking to a few more to subscribe once we hit our first release this November. If you’re interested in being included in our beta testing program, drop us a line at

101 On DevOps And How We Plan On Helping

Here’s what usually happens: on one side the development team wants to push new features as fast as possible to production, while on the other side, operations are trying to keep things stable. Both teams are evaluated on criteria that are often directly conflicting. The stronger team win one argument… Until the next crisis. And we’re not even talking about other teams, they too have conflicting agendas to throw in the mix.

There’s no silver bullet for getting evryone to play nicely. That being said, having Dev and Ops on the path of cooperation is not impossible. DevOps is exactly this; fostering a culture of best practices and collaboration between these teams.

In a very similar fashion to what happened with the agile movement, a lot of tools and approaches emerged that can help: methodologies (SCRUMs, kanban…), tools for configuration management (Chef, Puppet), orchestration, automation, logging… But at the core lies the need for nurturing a specific culture.

Operations teams have been slower to adopt these methodologies compared to development teams. The average system administrator spend more time working in FIFO, putting up fires, rather than making long term investments in automation or setting up best practices. Moreover, operations teams are usually faced with a logic of budget cuts and cost “optimizations”, compared to the larger R&D budget development teams seem to enjoy.

Even when conditions (and resources) are favorable to the growth of a proper culture, recruiting the right profiles can prove very challenging. We’re here talking about people with a wide range of skills, on-hands experience and strong collaboration and organizational skills. All of this take time. Best practices are forged through years of experience.

That is why we’re building We’re trying to lower the barriers of entry to this field and help professionals scale themselves in their role. We’re a motivated team of engineers who have worked on both sides of the fence, with small to very large code bases and infrastructures. We hope to untangle the mess that often is infrastructure and application management, letting technical teams focus on the higher value tasks of the job.