Managing a Huge Array on a Unix/Linux System

  • Thread starter Thread starter Hurkyl
  • Start date Start date
  • Tags Tags
    Array System
Click For Summary

Discussion Overview

The discussion revolves around managing a very large array in a C++ program on a Unix/Linux system, specifically addressing the challenges and strategies for handling data sizes exceeding 2^32 bytes. Participants explore various approaches, including memory management techniques and optimizations.

Discussion Character

  • Technical explanation
  • Exploratory
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant expresses the need to manage an array larger than 2^32 bytes and inquires about options beyond writing custom swap file management code.
  • Another participant suggests considering a sparse array if the data contains many zeroes, although the original poster indicates the entire array will be filled.
  • A participant proposes creating a large partition for swap space and mentions the potential need to recompile the Linux kernel for 64-bit pointers to handle more than 4 GB of virtual memory.
  • Memory mapping (using mmap) is recommended as a solution to access large arrays without needing extensive special code, with a suggestion to map smaller segments of the array at a time to avoid complications with pointer sizes.
  • One participant requests resources for learning about memory mapping functions, indicating a lack of familiarity with them.
  • A code snippet is provided to illustrate how to implement a C++ class that manages memory mapping for segments of the array, including operator overloading for array access.
  • Participants discuss optimizations for frequent array accesses and the choice between MAP_SHARED and MAP_PRIVATE based on whether changes need to be saved to disk.
  • Questions arise about whether mmap will resize the file automatically and what cleanup is necessary after using mmap.
  • Concerns are raised about the time required to process such a large array, with suggestions to identify and avoid irrelevant parts of the array to improve efficiency.
  • One participant shares a performance observation regarding a loop running time, highlighting the potential inefficiency of processing large arrays.

Areas of Agreement / Disagreement

Participants express multiple competing views on the best approach to manage large arrays, with no consensus reached on a single solution. There is agreement on the importance of optimizing array processing, but specific strategies vary.

Contextual Notes

Participants mention potential limitations related to kernel recompilation for 64-bit support and the need for careful management of memory mapping and file descriptors. The discussion also highlights the complexity of handling large data sets efficiently.

Who May Find This Useful

This discussion may be useful for software developers working with large data sets in C++ on Unix/Linux systems, particularly those interested in memory management techniques and performance optimization strategies.

Hurkyl
Staff Emeritus
Science Advisor
Gold Member
Messages
14,922
Reaction score
28
My array is big. No, bigger than that. It's big!

I'm writing a C++ program on a unix box (or is it linux? I never notice the difference), and I want to operate on a very large amount of data: more than 2^32 bytes.

Do I have any options aside from writing my own code to manage a swap file on disk?
 
Computer science news on Phys.org
Do you actually have 2^32 non-zero bytes? Or is most of your array zeroes? If so, you might want to implement (or use an existing) sparse array.

- Warren
 
The whole thing does get filled. I -might- be able to work with only part of the array at a time, and I -might- be able to identify large portions of the array as irrelevant, but I want to have an approach ready if I'm unable to manage either of these.

(And, of course, there's raw, unadultered curiosity that needs satisfied! :smile:)
 
Well, the easiest solution from an application-code perspective is to make a 4+ GB partition on your disk and use it for swap. Then you should be able to call malloc() and have it actually allocate 4GB for you in one block. There's a good chance you'll have to recompile your linux kernel to handle >4 GB of virtual memory address space, since it'll require 64-bit pointers. I think it's possible, but it might get hairy.

You might also want to consider memory-mapping your 4 GB data file (mmap() in sys/mman.h). This will still require 4+ GB of virtual address space, and still might get hairy.

Either of those two solutions would allow you to access elements of your huge array with nary a line of special code.

But... don't do any of that.

The easiest solution is to memory map *part* of the file at a time -- say, 1 GB or 2 GB of it at once, and switch from one segment to the other when necessary. That way you won't have to worry about using 64-bit pointers or any of that hairy stuff I mentioned in the first two paragraphs of this post. Memory-mapping doesn't actually read the file's contents into memory or anything, so there's not much of a performance hit in switching from one part of the file to another. You will, of course, have to abstract your array to automatically map the correct part of the file each time it is accessed.

- Warren
 
All right. I've never used any of the memory mapping functions... I've peeked at the man pages, but they weren't too enlightening last time I looked. Do you know of a link (or can you give yourself) a brief intro to using them?
 
http://www.opengroup.org/onlinepubs/007908799/xsh/mmap.html

That's mmap() POSIX manpage.

If I were you, I would make a C++ class that overloads the [] operator to automatically check which segment is currently mapped, and map a new segment if it's not the right one. The operative code might be something like

Code:
class HugeArray {
 public:
   // segmentSize is in number of slots, NOT bytes
   HugeArray(int fileDescriptor, int segmentSize) {
     this->segmentSize = segmentSize;
     this->fileDescriptor = fileDescriptor;
     currentSegment = -1;
     segment = 0;
   }

   ~HugeArray() {
      delete segment;
   }

   int& operator[] (unsigned int i);
 
 private:
   int segmentSize;
   void* segment;
   int currentSegment;
   int fileDescriptor;
 };

int& HugeArray::operator[](unsigned int i) {
  int neededSegment = (int) (i / segmentSize);
  if (neededSegment != currentSegment) {
    delete segment;
    segment = mmap(0, segmentSize * sizeof(int), PROT_READ | PROT_WRITE, MAP_PRIVATE, fileDescriptor, neededSegment * segmentSize * sizeof(int));
    currentSegment = neededSegment;
  }
  return ((int*) segment)[i - (currentSegment * segmentSize)];
}

I'm not on a 'nix box where I can check this for syntax and functionality right now, but I will be glad to help you later if you need more help.

- Warren
 
Last edited:
BTW, if you're doing a whoooole lot of array accesses, there are a few optimizations you can make here and there to this code... though a good compiler may already do them for you.

- Warren
 
Oh, and you should probably use MAP_SHARED instead of MAP_PRIVATE if you're going to be storing changes to the array back to disk.

- Warren
 
I'm generating the array in code rather than reading it from disk... will mmap resize the file as necessary, or will I need to make it myself?


My program will need the array until it's finished, but for future reference, is there any additional cleanup code that would be necessary, or is deleting the mapped memory and closing the FD sufficient?
 
  • #10
I'm pretty sure that mmap will resize the file as needed. I'm also pretty sure you don't need to do any special cleanup beyond closing the file descriptor properly. Don't quote me on that, though -- you might want to run a couple of tests first.

- Warren
 
  • #11
Hurkyl, you should seriously consider identifying irrelevant parts of that array. Even the most trivial processing (say, find the sum modulo 2^32) of an array with the size 2^32 will take a LOT of time.
 
  • #12
rgrig is right. Secondly, with disk accesses it will be even slower.
 
  • #13
I might have exagerated a bit. The code:
Code:
unsigned i;
for (i = 0; i != unsigned(-1); ++i);
runs in about 4.5 seconds on a Pentium 4 at 2.66GHz, when compiled with gcc version 3.3 with the flag -O2 (i.e. optimizations turned on).
 
  • #14
Even so, the first thing anyone should ever do when facing an array of this size is to find ways to avoid processing the entire array. :smile:

- Warren
 

Similar threads

  • · Replies 10 ·
Replies
10
Views
6K
Replies
7
Views
3K
Replies
10
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 10 ·
Replies
10
Views
2K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
Replies
38
Views
5K
  • · Replies 37 ·
2
Replies
37
Views
7K
Replies
60
Views
11K