Introduction

This is going to be a quick spew of code and configuration snippets. If you put it all together it might help you.

The Setup

  1. A process listening on a port that needs to be clustered on a host.
  2. A series of hosts running Nginx in front of the processes.
  3. An TCP Elastic Load Balancer in EC2

A TCP ELB is being used instead of a HTTP(s) balancer because in the setup SSL must be terminated at Nginx with client certificate validation to ensure TLS standards, Perfect Forward Secrecy, Client Authentication and cipher set are configured correctly and maintained. Normally, a HTTP(s) ELB does the trick.

How it works

In short; start-stop-daemon is used to start / stop a cluster of node.js workers. The workers are reverse proxied by Nginx, along with a healthcheck. This allows the ELB to poll the actual cluster.

Application Design

The application needs to be split into 2 main components: 1. The master, that opens file descriptors, monitors the cluster and does internal routing 2. The worker, that does the heavy lifting / computational tasks.

Nodejs Clustering

Node has a built in library for clustering. The parent process can open file descriptors and share them with the children. It can internally route traffic to its workers and does this using a round robin by default. For most purposes, this works okay on its own.

The parent

cluster.js

var cluster = require('cluster')
  , os      = require('os');

cluster.setupMaster({
    file: 'worker.js'
});

// fork one child per available cpu core
for (var i=0; i < os.cpus().length; i++) {
    cluster.fork();
}

The worker

worker.js

var cluster = require('cluster')
  , http    = require('http');

var server = http.createServer(function(request, response){

    // do something with request
    response.writeHead(200);
    response.end();

}).listen(3000, '127.0.0.1', function(){
    console.log('worker %d is now listening on port %d', process.pid, server.address().port);
});

Messaging between the master and its workers

cluster.js

// on the fork event, grab ta reference to the newborn
cluster.on('fork', function(child){
    // bind an event listener to it
    child.on('message', function(message, data){
        console.log('message from pid %d - %s: %s', child.process.pid, message, JSON.stringify(data));
    });
}

worker.js

if (cluster.isWorker) {
    process.send('messageType', {
        attribute: 'value'
    });
}

Think of the children

cluster.js

// keep a list of processes that are busy
var busy_children = [];

function addBusy(pid) {
    var idx = busy_children.indexOf(pid);
    if (idx !== -1) return;
    busy_children.push(pid);
}

function removeBusy(pid) {
    var idx = busy_children.indexOf(pid);
    if (idx === -1) return;
    busy_children.splice(idx, 1);
}

cluster.on('fork', function(child){
    child.on('message', function(message){
        switch(message) {
            case 'worker.busy':
                addBusy(child.process.pid);
                break;
            case 'worker.free':
                removeBusy(child.process.pid);
                break
        }
    });

    child.on('exit', function(child){
        removeBusy(child.process.pid);

        // start a new child to replace the one that just exited.
        cluster.fork();

    });
});

worker.js

function busy() {
    if (cluster.isWorker) {
        process.send('worker.busy');
    }
}

function free() {
    if (cluster.isWorker) {
        process.send('worker.free');
    }
}

function heavyLifting(done) {
    // do something
    done();
}

busy();
heavyLifting(function(){
    free();
});

Actually removing a worker from the round-robin scheduler

cluster.js

worker.on('message', function(message){
    switch (message) {
        case 'worker.busy':
            worker.disconnect();
            break;
    }
});

worker.js

function busy() {
    // don't attempt if IPC is disconnected
    if (process.connected && cluster.isWorker) {
        process.send('worker.busy');
    }
}

function free() {
    // respawned with on exit event
    process.exit();
}

function heavyLifting(done) {
    // do something
    done();
}

busy();
heavyLifting(function(){
    free();
});

Giving us a useable interface for the ELB to check

The ELB http heath check basically is only looking for the status. 200 OK is up, anything else is down. We can use 503 Service Unavailable as this suits the purpose best.

cluster.js

var http = require('http')
  , cluster = require('cluster');

var busy_workers = [];

function imBusy() {
    return (cluster.workers.length <= busy.workers.length)
}

var server = http.createServer(function(request, response){

    if (imBusy()) {
        response.writeHead(503);
    } else {
        response.writeHead(200);
    }

    response.end();
}

}).listen(1337, '127.0.0.1', function(){
    console.log('health checks now listening on port %d', server.address().port)
});

Nginx configuration

upstream application {
    server 127.0.0.1:3000;
}

upstream healthcheck {
    server 127.0.0.1:1337;
}

server {
    listen 80;
    servername: _;

    root /var/www;
    index index.html;

    access_log /var/log/nginx/application.access.log;
    error_log  /var/log/nginx/applcation.error.log debug;

    location / {
        proxy_pass http://application/;
    }
}

server {
    listen 81;
    servername _;

    access_log /var/log/nginx/healthcheck.access.log;
    error_log  /var/log/nginx/healthcheck.error.log debug;

    location / {
        proxy_pass http://healthcheck/;
    }
}

ELB configuration

:~$ aws elb describe-load-balancers | grep -i '\(health\|listen\)'
HEALTHCHECK 2   5   HTTP:81/    2   2
LISTENER    80  TCP    80  TCP

Outcome

Since this pattern was very common in our environment, I've wrapped up a couple of libraries that work nicely together. Though there are other libraries out there that do similar things, I felt they didn't really fit our application.

  • is-clusta for forking children and providing the health check.
  • is-daemon for handling signals, messaging and other daemonish things.