Fast random access of elements in a large file

Click For Summary

Discussion Overview

The discussion centers on strategies for efficiently accessing random elements from a large data file (12GB in size with dimensions 1024x768x16384). Participants explore various methods to reduce access time, considering both hardware and software approaches.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant suggests using a computer with sufficient RAM to load the entire file, or utilizing cloud services for temporary access to more RAM.
  • Another participant recommends using an SSD to reduce seek time, arguing that this would significantly speed up access compared to traditional hard drives.
  • A different approach proposed involves duplicating the file in a format that allows for more contiguous access, potentially reducing the number of seeks required during data retrieval.
  • Some participants discuss the possibility of optimizing the file format based on access patterns to improve performance.
  • One participant introduces the idea of using threading and double buffering to manage I/O and processing more efficiently, detailing a multi-threaded approach with specific roles for reading, realigning, and processing data.

Areas of Agreement / Disagreement

Participants express various strategies and ideas, but there is no consensus on a single best approach. Multiple competing views on hardware and software optimizations remain unresolved.

Contextual Notes

Participants acknowledge limitations related to file fragmentation and the need for potentially significant changes to file formats to achieve optimal access speeds. The discussion also highlights the dependency on specific hardware configurations and access patterns.

DrFurious
Messages
23
Reaction score
0
Hello all,
I have a question concerning the best way to access random elements in a large file. I have a single data file of numbers, 12G in size with dimensions 1024x768x16384. My goal is to, as quickly as possible, load, say 16384 of those numbers into memory. Assume these numbers are not aligned on the disk.

What is the fastest way of doing this? I've tried using HDF5, lseek-ing with my own program in C, etc, but the best I can do is about 3 minutes for this amount. Does anyone have ideas on the fastest approach?
 
Technology news on Phys.org
I'm not an expert on this, but here's a few ideas:

1.)use a computer with enough RAM to load the whole file. If this is a one-off problem that doesn't justify putting a lot more RAM on your machine, or if your computer can't accommodate ~>=16GB, Amazon cloud services rents such computers pretty cheaply.

2.) Use an SSD. Most of the time spent is almost certainly in seek time, which will be cut dramatically with an SSD, and an SSD will make your system faster for other purposes, too.

3.) Duplicate the file but with the format changed to make your accesses contiguous on the disk - say a 16384x1024x768 format. With limited RAM, assuming this is a stack of 2^14 images, you might cut down on the time needed to do this by reading a contiguous strip of each of the images in the stack, (likely n lines of 1024 pixels) with the strips as large as possible to fit 2^14 of them in RAM, reorder the format, write back to the new file and then start another batch of strips. This would cost ~1 seek per strip rather than 1 per pixel, and should be much faster than an pixel-stack-by-pixel-stack approach to reordering, assuming the original file isn't hopelessly fragmented.

If I'm misunderstanding you, and you want truly random accesses, I don't think there is any way of efficiently doing that for that size problem on a standard spinning hard drive. Caching the data in a faster storage medium would be needed to get a speedup. If the accesses are not random, then the file format needs to be adapted to the pattern of the accesses to make the accesses contiguous on the disk.
 
The way I understand it, you have 2 options for optimizing:
1. Either process and store the file into a format to account for any patterns in your data and access this new optimized file.
OR
2. Read from memory with better random access than you currently have(RAID0/SSD/RAM). :)

Edit: Heh, basically what EWH said. :)
 
I suggest using threads and double buffering (or even multi-buffering) if you doing some extensive computations on those data once they are in memory and if there is any way to predict ahead what the next read will be. From your brief description of the problem, I see several threads:
  • A reader thread.
    This highly I/O bound thread reads data from the file into a raw data buffer. While the thread is reading it has a write lock on the raw data buffer into which it is reading.
  • A realignment thread.
    This highly CPU-bound (but fast) thread reads from one of the raw data buffers (you'll have two) and writes it to a data structure in a manner that the data are properly aligned. The realignment thread has a read locks on the raw data buffer and a write lock on the aligned buffer. Note that you can get rid of this thread if the data will be properly aligned if the reader reads into a buffer that is aligned on a double (or long double) boundary. You will need this realignment thread if there is no escaping this reorganization. Alternatively, you could just combine the realignment thread with the I/O thread.
  • A processing thread.
    This thread reads from the aligned buffer and does the heavy CPU lifting on the data in that buffer. While reading from the aligned buffer this thread has a read lock on that buffer. This thread is CPU-bound, and it potentially can be very involved. The payoff is greatest when the processing thread is doing a whole lot of processing.
  • A master thread.
    This thread controls the other three via condition variables that cause the other threads to sleep or act.
    • Case 1: The processing thread needs to operate on a record that isn't either one of your already-read buffers.
      This will happen on the initial read and whenever the prediction mechanism goes awry. The master thread needs to put the processing thread to sleep, for example, via a condition variable, until the requested buffer is ready to go. It needs to do the same to the realignment thread. It then tells the reader thread to read the requested raw data buffer. When the reader thread has finished reading, the realignment thread can go to work on that just-read raw data buffer. Meanwhile, the reader thread starts reading the next chunk of the input file into the alternate raw data buffer. The processing thread can wake up once the realignment thread finishes its work, and the realignment thread can work on the next raw data buffer once the reader finishes reading the next record.

      Note: If the reader I/O thread is not idle when you reach this state, it might be best to signal that reader to stop in its tracks. It is reading a chunk of data that will need to be immediately tossed.
    • Case 2: The processing thread needs to operate on the current realigned buffer.
      This is a no-op as far as the other threads are concerned. They just continue doing what they were doing (or just continue sleeping if they are doing nothing).
    • Case 3: The processing thread needs to operate on the alternate realigned buffer.
      That buffer may be ready to go. If not, the processing thread will have to sleep for a bit. (The actions should already be in place). Once the processing thread is ready to go, the reader thread can be set in motion to read the next predicted record, and then the realignment thread can be set in motion to operate on that.

The potential payoff here is immense, particularly so if the processing thread is doing a whole lot of processing. This partitioning into I/O bound and CPU-bound threads is very common, and can at times make the I/O appear to be invisible. If you are doing an immense amount of processing, you can have multiple processing threads running in parallel rather than just one processing thread. This enables your application to make use all of the cores in your CPU instead of just one. (But this will probably call for a multi-buffering solution rather than just double buffering.)
 
Last edited:

Similar threads

  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 11 ·
Replies
11
Views
5K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
4K
  • · Replies 4 ·
Replies
4
Views
8K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 1 ·
Replies
1
Views
3K
Replies
17
Views
6K