Processing big files in Node.js

Finding the right tool for the job

The problem of big files

Recently I had something which seemed to be quite a simple task, regular file processing. The job required transforming each line of a file into a request to an external system. Each line was encoded as a JSON object, so no complicated parsing was involved.

Processing each line from a file seems to be a simple problem to solve in Node.js. I thought that the following bit of code will take care of that:

const fs = require('fs');

fs.readFile('file.txt', 'utf8', (err, data) => {
  if (err) {
    throw err;
  }
  data.split('\n').forEach(line => {
    const obj = JSON.parse(line);
    /* ... */
  })
});

It works great unless you are trying to process a file which is too big (where exactly lies the limit and what is causing it I wasn’t able to investigate). In my case, it failed on 930MB JSONL file. The following exception:

buffer.js:378
    throw new Error('toString failed');
    ^

Error: toString failed
    at Buffer.toString (buffer.js:378:11)
    at Object.fs.readFileSync (fs.js:496:33)
    at Object.<anonymous> (test.js:2:17)
    at Module._compile (module.js:425:26)
    at Object.Module._extensions..js (module.js:432:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:313:12)
    at Function.Module.runMain (module.js:457:10)
    at startup (node.js:138:18)
    at node.js:974:3

wasn’t too helpful. When digging around I figured out that the problem happened because of the failed conversion from the Node’s buffer object into a string. Node.js tried to do that because the call readFile specified the encoding. That meant that it had to convert the whole content of the file into a string. For some reason, it was failing. The memory couldn’t be an issue because my machine has plenty of it although it could be that the Node.js process is in some way restricted to not consume all of it.

That exception meant that I had to look into other ways of solving my problem.

I knew that I could read parts of the buffer returned from readFile and encode that into UTF-8. Then manually look for the new line marker and glue those lines together. It’s not too complicated piece of code but I was sure that I wasn’t the first person having to write it.

As it turns out in Node, there’s more than one solution to that problem.

node-lines-adapter

First was the package called node-lines-adapter. It converts any stream of data into one which emits whole lines.

const fs = require('fs');
const lines = require('lines-adapter');

const stream = fs.createReadStream('events.jsonl');
lines(stream, 'utf8')
.on('data', line => {
  const obj = JSON.parse(line);
  /* line processing */
}).on('end', () => {
  /* cleanup */
});

The code is concise and readable. I could finish there, but the github page for the node-lines-adapter mentions another package to try out, called node-lazy

node-lazy

The following code looks very similar to the one using the previous package.

const fs = require('fs');
const lazy = require('lazy');

const stream = fs.createReadStream('events.jsonl');
lazy(stream)
.lines
.map(JSON.parse)
.forEach(obj => {
  /* line processing */
});

The advantage which node-lazy has over node-lines-adapter is the rich set of methods used for interacting with the stream of lines. In solving my problem things like .skip() and .take() turned out to be very useful.

Summary

Processing a big file in Node.js is certainly doable and there are ready made solutions which can help with that task. Both tested libraries are doing what they advertise to do. My decision to choose node-lazy over node-lines-adapter was dictated by the richer set of tools provided out of the box. That way I didn’t have to build them myself.