Processing big files in Node.js
Finding the right tool for the job
The problem of big files
Recently I had something which seemed to be quite a simple task, regular file processing. The job required transforming each line of a file into a request to an external system. Each line was encoded as a JSON object, so no complicated parsing was involved.
Processing each line from a file seems to be a simple problem to solve in Node.js. I thought that the following bit of code will take care of that:
const fs = require('fs');
fs.readFile('file.txt', 'utf8', (err, data) => {
if (err) {
throw err;
}
data.split('\n').forEach(line => {
const obj = JSON.parse(line);
/* ... */
})
});
It works great unless you are trying to process a file which is too big (where exactly lies the limit and what is causing it I wasn’t able to investigate). In my case, it failed on 930MB JSONL file. The following exception:
buffer.js:378
throw new Error('toString failed');
^
Error: toString failed
at Buffer.toString (buffer.js:378:11)
at Object.fs.readFileSync (fs.js:496:33)
at Object.<anonymous> (test.js:2:17)
at Module._compile (module.js:425:26)
at Object.Module._extensions..js (module.js:432:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:313:12)
at Function.Module.runMain (module.js:457:10)
at startup (node.js:138:18)
at node.js:974:3
wasn’t too helpful. When digging around I figured out that the problem happened because of the failed conversion from the Node’s buffer object into a string. Node.js tried to do that because the call readFile
specified the encoding. That meant that it had to convert the whole content of the file into a string. For some reason, it was failing. The memory couldn’t be an issue because my machine has plenty of it although it could be that the Node.js process is in some way restricted to not consume all of it.
That exception meant that I had to look into other ways of solving my problem.
I knew that I could read parts of the buffer returned from readFile
and encode that into UTF-8. Then manually look for the new line marker and glue those lines together. It’s not too complicated piece of code but I was sure that I wasn’t the first person having to write it.
As it turns out in Node, there’s more than one solution to that problem.
node-lines-adapter
First was the package called node-lines-adapter. It converts any stream of data into one which emits whole lines.
const fs = require('fs');
const lines = require('lines-adapter');
const stream = fs.createReadStream('events.jsonl');
lines(stream, 'utf8')
.on('data', line => {
const obj = JSON.parse(line);
/* line processing */
}).on('end', () => {
/* cleanup */
});
The code is concise and readable. I could finish there, but the github page for the node-lines-adapter
mentions another package to try out, called node-lazy
node-lazy
The following code looks very similar to the one using the previous package.
const fs = require('fs');
const lazy = require('lazy');
const stream = fs.createReadStream('events.jsonl');
lazy(stream)
.lines
.map(JSON.parse)
.forEach(obj => {
/* line processing */
});
The advantage which node-lazy
has over node-lines-adapter
is the rich set of methods used for interacting with the stream of lines. In solving my problem things like .skip()
and .take()
turned out to be very useful.
Summary
Processing a big file in Node.js is certainly doable and there are ready made solutions which can help with that task. Both tested libraries are doing what they advertise to do. My decision to choose node-lazy
over node-lines-adapter
was dictated by the richer set of tools provided out of the box. That way I didn’t have to build them myself.