Until this summer, I’d never written a piece of code where file IO was the performance-limiting factor. It’s often tempting to call file read-write times insignificant, but this notion certainly did not carry over to terabyte-sized datasets. It got to the point that I was having to plan ahead to avoid unnecessary file copies. So I quickly started looking, thinking to myself how can I make this faster?

The single most important thing to do is buffer. Buffering is essentially reading and writing in bulk. It’s far more efficient to read a large file 64 kilobytes at a time than a character at a time. Most high-level languages somewhat transparently handle that for you, but of course at a low level it’s hard to beat the performance of properly buffered reads and writes into and out of byte buffers in a language like C/C++ or Java. The performance gain of buffering is mainly due to OS optimizations when you read/write large chunks of data sequentially from disk.

Bulk Read Performance Test

You can look up the details of why it works, but before proceeding, let’s prove the performance difference is real. We’ll do this by comparing the performance difference between reading a file 1 byte vs. 64 kilobytes at a time, using the relatively low-level Java class FileInputStream. For our tests, we’ll use the file dictionary.txt (download here), which is about 3 megabytes in total.

public static void main(String[] args) throws IOException {
    File dictionaryFile = new File("dictionary.txt");
    InputStream dictionaryStream = new FileInputStream(dictionaryFile);
    // read the file byte by byte
    while ((dictionaryStream.read()) != -1);
    dictionaryStream.close();
}

And the results? A whopping 5.9 seconds to read a simple 3M file…

$ time java SingleByteTest
java SingleByteTest 2.72s user 3.13s system 98% cpu 5.916 total

So let’s be a bit smarter about this, and read more than a byte at a time.

public static void main(String[] args) throws IOException {
    int BUFFER_SIZE = 1024 * 64;
    byte[] buffer = new byte[BUFFER_SIZE];
    File dictionaryFile = new File("dictionary.txt");
    InputStream dictionaryStream = new FileInputStream(dictionaryFile);
    // read the file BUFFER_SIZE bytes at a time
    while ((dictionaryStream.read(buffer)) != -1);
    dictionaryStream.close();
}

This one change brings the runtime down to 0.2 seconds–I guess you can call that an improvement.

$ time java BulkByteTest
java BulkByteTest 0.11s user 0.06s system 95% cpu 0.174 total

Idea Behind Parallelizing

Obviously buffering is the most important thing we can do—nothing else we do hereafter will bring about that type of speed improvement. However, a lot of programs that have to handle big files tend to follow a similar design: read in the relevant portion of the file, process this data, and write the processed data to another file for future use. We’ll assume that the process is IO bound, so parallelizing makes sense. If the processing is where all the time is being eaten up and it’s not sequentially stateful (you don’t need to process all the previous contents of the file to process the next chunk), you can chunk the file or memory map it and process each chunk in parallel, but that’s for another discussion.

In this read, process, and write design, notice that the read portion happens independently of the processing portion which happens indepently of the writing portion. So essentially, we want to create independent thread pools for each portion. Assuming the processing is sequentially stateful, we can have only one thread reading, processing, and writing at a time, but there’s nothing to say, for instance, that we can’t be reading and writing at the same time! Let’s see how we can implement this.

Implementation

Why don’t we dive straight into some code!

byte[] buffer, otherBuffer;
Future<Void> process, otherProcess;
Executor executor;
InputStream istream;
OutputStream ostream;

public static int read(InputStream istream, byte[] buffer) {
    return istream.read(buffer);
}

public static void handle(byte[] buffer) {
    // inplace process contents of buffer
}

public static void write(OutputStream ostream, byte[] buffer) {
    ostream.write(buffer);
}

// Actually handling the reading, processing, and writing
public void act() {
    while (read(istream, buffer) != -1) {
        handle(buffer);
        byte[] processBuffer = buffer;
        process = () -> {
            write(ostream, processBuffer);
        }
        executor.run(process);
        if (otherProcess != null) {
            otherProcess.get();
            otherProcess = null;
        }
        otherProcess = process;
        byte[] tmp = buffer;
        buffer = otherBuffer;
        otherBuffer = tmp;
        process = null;
    }
}

So what’s going on here? Futures are used to allow reads and writes to go on at the same time. Essentially, the main thread handles reading the InputStream into buffer and processing the data. Afterwards, a Future is created to write the buffer which is executed on another thread, and the main thread continues. We keep only two byte buffers to avoid unnecessary allocation—it’s all that’s needed since we never have multiple threads reading at the same time or multiple threads writing at the same time. However, note that one won’t suffice—we do need two buffers, so that the read process can read into a buffer while the write process is writing the old buffer.


Keep Reading

  • The Duel

    When Eleanor Moreau threw a dinner party, everyone of note attended. There was a certain craft to being a hostess, and Eleanor had mastered it. She always seemed to know who would be in town—visiting socialites would have received invitations by post before they set out. Some attended for food, some for wine, and some for society. No one would say they attended for the surprise, for that would come far too close to the truth.