Fast random access of elements in a large file

DrFurious · Jul 11, 2011

Hello all,
I have a question concerning the best way to access random elements in a large file. I have a single data file of numbers, 12G in size with dimensions 1024x768x16384. My goal is to, as quickly as possible, load, say 16384 of those numbers into memory. Assume these numbers are not aligned on the disk.

What is the fastest way of doing this? I've tried using HDF5, lseek-ing with my own program in C, etc, but the best I can do is about 3 minutes for this amount. Does anyone have ideas on the fastest approach?

EWH · Jul 11, 2011

I'm not an expert on this, but here's a few ideas:

1.)use a computer with enough RAM to load the whole file. If this is a one-off problem that doesn't justify putting a lot more RAM on your machine, or if your computer can't accommodate ~>=16GB, Amazon cloud services rents such computers pretty cheaply.

2.) Use an SSD. Most of the time spent is almost certainly in seek time, which will be cut dramatically with an SSD, and an SSD will make your system faster for other purposes, too.

3.) Duplicate the file but with the format changed to make your accesses contiguous on the disk - say a 16384x1024x768 format. With limited RAM, assuming this is a stack of 2^14 images, you might cut down on the time needed to do this by reading a contiguous strip of each of the images in the stack, (likely n lines of 1024 pixels) with the strips as large as possible to fit 2^14 of them in RAM, reorder the format, write back to the new file and then start another batch of strips. This would cost ~1 seek per strip rather than 1 per pixel, and should be much faster than an pixel-stack-by-pixel-stack approach to reordering, assuming the original file isn't hopelessly fragmented.

If I'm misunderstanding you, and you want truly random accesses, I don't think there is any way of efficiently doing that for that size problem on a standard spinning hard drive. Caching the data in a faster storage medium would be needed to get a speedup. If the accesses are not random, then the file format needs to be adapted to the pattern of the accesses to make the accesses contiguous on the disk.

martix · Jul 11, 2011

The way I understand it, you have 2 options for optimizing:
1. Either process and store the file into a format to account for any patterns in your data and access this new optimized file.
OR
2. Read from memory with better random access than you currently have(RAID0/SSD/RAM). :)

Edit: Heh, basically what EWH said. :)

D H · Jul 12, 2011

I suggest using threads and double buffering (or even multi-buffering) if you doing some extensive computations on those data once they are in memory and if there is any way to predict ahead what the next read will be. From your brief description of the problem, I see several threads:

A reader thread.
This highly I/O bound thread reads data from the file into a raw data buffer. While the thread is reading it has a write lock on the raw data buffer into which it is reading.
A realignment thread.
This highly CPU-bound (but fast) thread reads from one of the raw data buffers (you'll have two) and writes it to a data structure in a manner that the data are properly aligned. The realignment thread has a read locks on the raw data buffer and a write lock on the aligned buffer. Note that you can get rid of this thread if the data will be properly aligned if the reader reads into a buffer that is aligned on a double (or long double) boundary. You will need this realignment thread if there is no escaping this reorganization. Alternatively, you could just combine the realignment thread with the I/O thread.
A processing thread.
This thread reads from the aligned buffer and does the heavy CPU lifting on the data in that buffer. While reading from the aligned buffer this thread has a read lock on that buffer. This thread is CPU-bound, and it potentially can be very involved. The payoff is greatest when the processing thread is doing a whole lot of processing.
A master thread.
This thread controls the other three via condition variables that cause the other threads to sleep or act.
- Case 1: The processing thread needs to operate on a record that isn't either one of your already-read buffers.
  This will happen on the initial read and whenever the prediction mechanism goes awry. The master thread needs to put the processing thread to sleep, for example, via a condition variable, until the requested buffer is ready to go. It needs to do the same to the realignment thread. It then tells the reader thread to read the requested raw data buffer. When the reader thread has finished reading, the realignment thread can go to work on that just-read raw data buffer. Meanwhile, the reader thread starts reading the next chunk of the input file into the alternate raw data buffer. The processing thread can wake up once the realignment thread finishes its work, and the realignment thread can work on the next raw data buffer once the reader finishes reading the next record.
  
  Note: If the reader I/O thread is not idle when you reach this state, it might be best to signal that reader to stop in its tracks. It is reading a chunk of data that will need to be immediately tossed.
- Case 2: The processing thread needs to operate on the current realigned buffer.
  This is a no-op as far as the other threads are concerned. They just continue doing what they were doing (or just continue sleeping if they are doing nothing).
- Case 3: The processing thread needs to operate on the alternate realigned buffer.
  That buffer may be ready to go. If not, the processing thread will have to sleep for a bit. (The actions should already be in place). Once the processing thread is ready to go, the reader thread can be set in motion to read the next predicted record, and then the realignment thread can be set in motion to operate on that.

The potential payoff here is immense, particularly so if the processing thread is doing a whole lot of processing. This partitioning into I/O bound and CPU-bound threads is very common, and can at times make the I/O appear to be invisible. If you are doing an immense amount of processing, you can have multiple processing threads running in parallel rather than just one processing thread. This enables your application to make use all of the cores in your CPU instead of just one. (But this will probably call for a multi-buffering solution rather than just double buffering.)

Fast random access of elements in a large file

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

HTML/CSS Problems with DNS records

PHP My website presents the visitor with the choice of opting out of using cookies....

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect