Node JS and Binary Data

UPDATE: None of this should be necessary, as FileReadStream in the latest node uses buffers by default. However, it appears that either I'm doing something wrong or the docs are out of date, as it doesn't work that way on node HEAD.

Two areas where the exclenned Node.js's sadly lacks is the handling of binary data and large strings. In this post I'd like to go over some techniques for dealing with binary data in node, most of which revolves around dealing with V8's garbage collector, and the fact that strings in node are not made for binary data, they're made for UTF-8 and UTF-16 data.

There are three main gory details that make working with data in Node.js a pain:

  1. Large Strings (> ~64K) are not your friend.
  2. Binary (and ASCII) data in a node string are stored as the first byte of a UTF-16 string.
  3. Binary data can be most efficiently stored in Node.js as a Buffer

Lets look at the first item, big strings aren't your friend. Node.js creator ry himself tackled this issue himself in a performance comparison he made with nginx. If you view the pdf, (or look at the extracted chart below) you'll see that node does a decent job keeping pace with nginx up until the 64 byte mark hits, then performance just falls apart. The reason, in ry's words:

V8 has a generational garbage collector [which] moves objects around randomly. Node can’t get a pointer to raw string data to write to socket.

You can see this in the relevant graph in ryan's slides, which I've conveniently extracted and posted below (I hope you don't mind Ryan).

 

Ry-node

What wasn't immediately obvious to me after reading this was what this meant in cases where one was using node to pass around large bits of binary data that come in as strings. if you use node to say, read from the file system you get back a binary string, not a buffer. My question was: "If I have binary data already stuck in a lousy UTF-16 string, but then stick it in buffer before sending it out, will that help with speed?." The answer an increase in throughput from 100 MiB/Sec to 160 MiB/Sec.

 

Check out the graph below from my own performance tests, where I played with different readChunk sizes (how much data the FileReadStream reads at once and buffer sizes (How much data we store in a buffer before flushing to a socket):

Node_buf_performance

As you can see performance using buffers (Buf) beats the pants off writes using strings (Str). The difference between the two pieces of code can be seen below. I initially didn't think that doing this conversion would help at all, I figured once it was already in a string (as data from a FileReadStream is), one may as well flush it to the socket and continue on. This makes me wonder if other apps would also be best off accumulating their output (perhaps even their UTF-8 output) in a Buffer where possible, then finally flushing the Buffer, instead of making repeat calls to res.write. Someone needs to test this. Additionally, this makes me wonder if further improvements to my own test case could be improved if the node FileReadStream object was modified to return a Buffer rather than a string.

Additionaly, you may be asking about using a larger bufSize than readChunk size, which I did indeed test, but found there was not much of a difference when using a larger buffer, so the optimal strategy really does seem to be reading a 64KiB chunk into a 64KiB buffer. You can see this data at the bottom of the post.

In the data I graphed above, I made a number of runs with `ab -c 100 -n 1000` against 1 MiB file changing the chunkSize and readSize. Relevant sample code can be seen below. The full sample code would be my fork of node-paperboy

 

The full performance data is available below:

 

Click here to download:
node-performance.pdf (24 KB)
(download)

 

I (still) have a crush on Wolfram Alpha

I have to admit right now that I've got a crush on Wolfram Alpha. Sadly, the "Computational Knowledge Engine" was latched onto by the media and over hyped. It's not going to change the world, as some reports breathlessly prophesied, and I have a feeling that a lot of people ended up writing it off as a cute toy, but for day to to day stuff, I've just found it generally awesome, like the google calculator on steroids,

These are some real world queries I as a sysadmin/programmer use it for:

1. Calculate the relative cost of hosting plans, even if one uses GiB/mo and the other uses mbps @ the 95th percentile.
2. Compare the running time of algorithms using Big O notation
3. Calculate, when, say 30 days ago was, or even 5 sundays ago, or something simple like 2 weeks in seconds,  or feed it a Unix time and get a bunch of useful info about it. 
4. How many horsepower can you get out of 1 carrot per hour (ok, maybe I've never actually had to use this one, but pondering the fuel consumption of horses is entertaining).

Invalid Command \N and restoring PostgreSQL dumps

During a recent DB restore from Heroku onto my local machine I got a ton of "Invalid Command \N" errors after doing a psql DBNAME -f DUMP.sql .

It took a lot of googling but I finally found the answer.
You've got to drop and recreate the database like so:
dropdb DBNAME && createdb -T template0 DBNAME

Then you can continue on your way, everything should work fine.

Concurrency in Clojure

The following video is of language designer Rich Hickey giving a brief overview of the Clojure language, and highlighting its approach to non-painful concurrency. Hickey's a fascinating and concise speaker, who really gets into the meat of the concurrency problems in application programming today, showing how Clojure was designed from the ground up to simplify the process of concurrent programming.

Node and process.nextTick

Coding in Node, or any evented setting, often requires different ways of thinking. The primary difference being that you have to keep track of when the code you're writing will run. I've really just started writing code in Node, below is a clarification of a couple things I found initially confusing. As an example:

This code will print out "BYE" before "HI". While initially counter-intuitive, you can actually leverage this to make more readable code. Below is an example from my fork of node-paperboy. As you can see, #deliver returns a delegate, which lets us set up our callbacks.

The interesting part about this is that after all our delegates are setup there's no need to call a method to says we're done adding methods to the delegate, and that its free to run and deliver the file. Looking at the implementation of #deliver we can get a little more information about how this works:

That's an incomplete portion of #deliver, a good chunk of it has been omitted for brevity, the most important part here is process.nextTick, everything within the anonymous function nextTick uses gets deferred until the next tick of the clock, somewhat similarly (though more efficiently) than `setTimeout(function() {}, 0);` . This allows us to return our delegate after this has been setup, to allow the user to set the callbacks via method calls on the delegate object. In this example, after the anonymous function passed to http.createServer is done executing will the next tick occur.

An important thing to remember is that you often don't need nextTick if you're performing operations that are guaranteed to run on the next tick. Anything wrapped inside an async request like fs.stat or an http.Client request will end up running on the next tick. The only reason that process.nextTick was explicitly required here was due to the synchronous check `if (fpErr) {...}`, the rest of the code runs wrapped inside of fs.stat, which is an async call.

Node events are in some ways similar to delegates in how they're defined, if you're interested, I recommend taking a look at the implementation for streamFile, as an example of how these are used.

Coding with node can be twisted (pun intended) but if you need the benefits an evented framework provides and you work with, not against, it isn't half bad.

Learning Clojure

So, I'm on the road towards learning Clojure, via Stuart Halloway's Programming Clojure. It's an OK book, hardly the gem (no pun intended) that the pickaxe was. However, it is thorough, and is less disjointed than simply googling around for Clojure info and docs.

With all that in mind, Casting SPELs in Clojure is a welcome companion to Halloway's book (perhaps best as an introduction). While not nearly as comprehensive, it's more readable, and has a more fluid style. After I get through the basics of Clojure, I'm looking forward to finally learning about Monads in Clojure.