Managing a Huge Array on a Unix/Linux System

Hurkyl · Sep 8, 2004

My array is big. No, bigger than that. It's big!

I'm writing a C++ program on a unix box (or is it linux? I never notice the difference), and I want to operate on a very large amount of data: more than 2^32 bytes.

Do I have any options aside from writing my own code to manage a swap file on disk?

chroot · Sep 8, 2004

Do you actually have 2^32 non-zero bytes? Or is most of your array zeroes? If so, you might want to implement (or use an existing) sparse array.

- Warren

Hurkyl · Sep 8, 2004

The whole thing does get filled. I -might- be able to work with only part of the array at a time, and I -might- be able to identify large portions of the array as irrelevant, but I want to have an approach ready if I'm unable to manage either of these.

(And, of course, there's raw, unadultered curiosity that needs satisfied!

)

chroot · Sep 8, 2004

Well, the easiest solution from an application-code perspective is to make a 4+ GB partition on your disk and use it for swap. Then you should be able to call malloc() and have it actually allocate 4GB for you in one block. There's a good chance you'll have to recompile your linux kernel to handle >4 GB of virtual memory address space, since it'll require 64-bit pointers. I think it's possible, but it might get hairy.

You might also want to consider memory-mapping your 4 GB data file (mmap() in sys/mman.h). This will still require 4+ GB of virtual address space, and still might get hairy.

Either of those two solutions would allow you to access elements of your huge array with nary a line of special code.

But... don't do any of that.

The easiest solution is to memory map *part* of the file at a time -- say, 1 GB or 2 GB of it at once, and switch from one segment to the other when necessary. That way you won't have to worry about using 64-bit pointers or any of that hairy stuff I mentioned in the first two paragraphs of this post. Memory-mapping doesn't actually read the file's contents into memory or anything, so there's not much of a performance hit in switching from one part of the file to another. You will, of course, have to abstract your array to automatically map the correct part of the file each time it is accessed.

- Warren

Hurkyl · Sep 8, 2004

All right. I've never used any of the memory mapping functions... I've peeked at the man pages, but they weren't too enlightening last time I looked. Do you know of a link (or can you give yourself) a brief intro to using them?

chroot · Sep 8, 2004

http://www.opengroup.org/onlinepubs/007908799/xsh/mmap.html

That's mmap() POSIX manpage.

If I were you, I would make a C++ class that overloads the [] operator to automatically check which segment is currently mapped, and map a new segment if it's not the right one. The operative code might be something like

Code:

class HugeArray {
 public:
   // segmentSize is in number of slots, NOT bytes
   HugeArray(int fileDescriptor, int segmentSize) {
     this->segmentSize = segmentSize;
     this->fileDescriptor = fileDescriptor;
     currentSegment = -1;
     segment = 0;
   }

   ~HugeArray() {
      delete segment;
   }

   int& operator[] (unsigned int i);
 
 private:
   int segmentSize;
   void* segment;
   int currentSegment;
   int fileDescriptor;
 };

int& HugeArray::operator[](unsigned int i) {
  int neededSegment = (int) (i / segmentSize);
  if (neededSegment != currentSegment) {
    delete segment;
    segment = mmap(0, segmentSize * sizeof(int), PROT_READ | PROT_WRITE, MAP_PRIVATE, fileDescriptor, neededSegment * segmentSize * sizeof(int));
    currentSegment = neededSegment;
  }
  return ((int*) segment)[i - (currentSegment * segmentSize)];
}

I'm not on a 'nix box where I can check this for syntax and functionality right now, but I will be glad to help you later if you need more help.

- Warren

chroot · Sep 8, 2004

BTW, if you're doing a whoooole lot of array accesses, there are a few optimizations you can make here and there to this code... though a good compiler may already do them for you.

- Warren

chroot · Sep 8, 2004

Oh, and you should probably use MAP_SHARED instead of MAP_PRIVATE if you're going to be storing changes to the array back to disk.

- Warren

Hurkyl · Sep 8, 2004

I'm generating the array in code rather than reading it from disk... will mmap resize the file as necessary, or will I need to make it myself?

My program will need the array until it's finished, but for future reference, is there any additional cleanup code that would be necessary, or is deleting the mapped memory and closing the FD sufficient?

chroot · Sep 8, 2004

I'm pretty sure that mmap will resize the file as needed. I'm also pretty sure you don't need to do any special cleanup beyond closing the file descriptor properly. Don't quote me on that, though -- you might want to run a couple of tests first.

- Warren

rgrig · Sep 12, 2004

Hurkyl, you should seriously consider identifying irrelevant parts of that array. Even the most trivial processing (say, find the sum modulo 2^32) of an array with the size 2^32 will take a LOT of time.

Goalie_Ca · Sep 13, 2004

rgrig is right. Secondly, with disk accesses it will be even slower.

rgrig · Sep 14, 2004

I might have exagerated a bit. The code:

Code:

unsigned i;
for (i = 0; i != unsigned(-1); ++i);

runs in about 4.5 seconds on a Pentium 4 at 2.66GHz, when compiled with gcc version 3.3 with the flag -O2 (i.e. optimizations turned on).

chroot · Sep 14, 2004

Even so, the first thing anyone should ever do when facing an array of this size is to find ways to avoid processing the entire array.

- Warren

Managing a Huge Array on a Unix/Linux System

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Who May Find This Useful

Similar threads

What Free Privacy-Focused AI Chatbots Don’t Use My Data for Training?

How far will we let AI control us?

If you think having a backup is too expensive, try not having one

Impersonation News

Cooling a processor chip

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers