Until this summer, I’d never written a piece of code where file IO was the performance-limiting factor. It’s often tempting to call file read-write times insignificant, but this notion certainly did not carry over to terabyte-sized datasets. It got to the point that I was having to plan ahead to avoid unnecessary file copies. So I quickly started looking, thinking to myself how can I make this faster?
The single most important thing to do is buffer. Buffering is essentially reading and writing in bulk. It’s far more efficient to read a large file 64 kilobytes at a time than a character at a time. Most high-level languages somewhat transparently handle that for you, but of course at a low level it’s hard to beat the performance of properly buffered reads and writes into and out of byte buffers in a language like C/C++ or Java. The performance gain of buffering is mainly due to OS optimizations when you read/write large chunks of data sequentially from disk.
Bulk Read Performance Test
You can look up the details of why it works, but before proceeding, let’s prove the performance difference is real. We’ll do this by comparing the performance difference between reading a file 1 byte vs. 64 kilobytes at a time, using the relatively low-level Java class
FileInputStream. For our tests, we’ll use the file
dictionary.txt (download here), which is about 3 megabytes in total.
And the results? A whopping 5.9 seconds to read a simple 3M file…
So let’s be a bit smarter about this, and read more than a byte at a time.
This one change brings the runtime down to 0.2 seconds–I guess you can call that an improvement.
Idea Behind Parallelizing
Obviously buffering is the most important thing we can do—nothing else we do hereafter will bring about that type of speed improvement. However, a lot of programs that have to handle big files tend to follow a similar design: read in the relevant portion of the file, process this data, and write the processed data to another file for future use. We’ll assume that the process is IO bound, so parallelizing makes sense. If the processing is where all the time is being eaten up and it’s not sequentially stateful (you don’t need to process all the previous contents of the file to process the next chunk), you can chunk the file or memory map it and process each chunk in parallel, but that’s for another discussion.
In this read, process, and write design, notice that the read portion happens independently of the processing portion which happens indepently of the writing portion. So essentially, we want to create independent thread pools for each portion. Assuming the processing is sequentially stateful, we can have only one thread reading, processing, and writing at a time, but there’s nothing to say, for instance, that we can’t be reading and writing at the same time! Let’s see how we can implement this.
Why don’t we dive straight into some code!
So what’s going on here? Futures are used to allow reads and writes to go on at the same time. Essentially, the main thread handles reading the
buffer and processing the data. Afterwards, a
Future is created to write the buffer which is executed on another thread, and the main thread continues. We keep only two byte buffers to avoid unnecessary allocation—it’s all that’s needed since we never have multiple threads reading at the same time or multiple threads writing at the same time. However, note that one won’t suffice—we do need two buffers, so that the read process can read into a buffer while the write process is writing the old buffer.
Aug 28, 2018
When Eleanor Moreau threw a dinner party, everyone of note attended. There was a certain craft to being a hostess, and Eleanor had mastered it. She always seemed to know who would be in town—visiting socialites would have received invitations by post before they set out. Some attended for food, some for wine, and some for society. No one would say they attended for the surprise, for that would come far too close to the truth.