Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Big arrays

  1. Sep 8, 2004 #1

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    My array is big. No, bigger than that. It's big!

    I'm writing a C++ program on a unix box (or is it linux? I never notice the difference), and I want to operate on a very large amount of data: more than 2^32 bytes.

    Do I have any options aside from writing my own code to manage a swap file on disk?
     
  2. jcsd
  3. Sep 8, 2004 #2

    chroot

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Do you actually have 2^32 non-zero bytes? Or is most of your array zeroes? If so, you might want to implement (or use an existing) sparse array.

    - Warren
     
  4. Sep 8, 2004 #3

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    The whole thing does get filled. I -might- be able to work with only part of the array at a time, and I -might- be able to identify large portions of the array as irrelevant, but I want to have an approach ready if I'm unable to manage either of these.

    (And, of course, there's raw, unadultered curiosity that needs satisfied! :smile:)
     
  5. Sep 8, 2004 #4

    chroot

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Well, the easiest solution from an application-code perspective is to make a 4+ GB partition on your disk and use it for swap. Then you should be able to call malloc() and have it actually allocate 4GB for you in one block. There's a good chance you'll have to recompile your linux kernel to handle >4 GB of virtual memory address space, since it'll require 64-bit pointers. I think it's possible, but it might get hairy.

    You might also want to consider memory-mapping your 4 GB data file (mmap() in sys/mman.h). This will still require 4+ GB of virtual address space, and still might get hairy.

    Either of those two solutions would allow you to access elements of your huge array with nary a line of special code.

    But... don't do any of that.

    The easiest solution is to memory map *part* of the file at a time -- say, 1 GB or 2 GB of it at once, and switch from one segment to the other when necessary. That way you won't have to worry about using 64-bit pointers or any of that hairy stuff I mentioned in the first two paragraphs of this post. Memory-mapping doesn't actually read the file's contents into memory or anything, so there's not much of a performance hit in switching from one part of the file to another. You will, of course, have to abstract your array to automatically map the correct part of the file each time it is accessed.

    - Warren
     
  6. Sep 8, 2004 #5

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    All right. I've never used any of the memory mapping functions... I've peeked at the man pages, but they weren't too enlightening last time I looked. Do you know of a link (or can you give yourself) a brief intro to using them?
     
  7. Sep 8, 2004 #6

    chroot

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    http://www.opengroup.org/onlinepubs/007908799/xsh/mmap.html

    That's mmap() POSIX manpage.

    If I were you, I would make a C++ class that overloads the [] operator to automatically check which segment is currently mapped, and map a new segment if it's not the right one. The operative code might be something like

    Code (Text):

    class HugeArray {
     public:
       // segmentSize is in number of slots, NOT bytes
       HugeArray(int fileDescriptor, int segmentSize) {
         this->segmentSize = segmentSize;
         this->fileDescriptor = fileDescriptor;
         currentSegment = -1;
         segment = 0;
       }

       ~HugeArray() {
          delete segment;
       }

       int& operator[] (unsigned int i);
     
     private:
       int segmentSize;
       void* segment;
       int currentSegment;
       int fileDescriptor;
     };

    int& HugeArray::operator[](unsigned int i) {
      int neededSegment = (int) (i / segmentSize);
      if (neededSegment != currentSegment) {
        delete segment;
        segment = mmap(0, segmentSize * sizeof(int), PROT_READ | PROT_WRITE, MAP_PRIVATE, fileDescriptor, neededSegment * segmentSize * sizeof(int));
        currentSegment = neededSegment;
      }
      return ((int*) segment)[i - (currentSegment * segmentSize)];
    }
     
    I'm not on a 'nix box where I can check this for syntax and functionality right now, but I will be glad to help you later if you need more help.

    - Warren
     
    Last edited: Sep 8, 2004
  8. Sep 8, 2004 #7

    chroot

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    BTW, if you're doing a whoooole lot of array accesses, there are a few optimizations you can make here and there to this code... though a good compiler may already do them for you.

    - Warren
     
  9. Sep 8, 2004 #8

    chroot

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Oh, and you should probably use MAP_SHARED instead of MAP_PRIVATE if you're going to be storing changes to the array back to disk.

    - Warren
     
  10. Sep 8, 2004 #9

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    I'm generating the array in code rather than reading it from disk... will mmap resize the file as necessary, or will I need to make it myself?


    My program will need the array until it's finished, but for future reference, is there any additional cleanup code that would be necessary, or is deleting the mapped memory and closing the FD sufficient?
     
  11. Sep 8, 2004 #10

    chroot

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    I'm pretty sure that mmap will resize the file as needed. I'm also pretty sure you don't need to do any special cleanup beyond closing the file descriptor properly. Don't quote me on that, though -- you might want to run a couple of tests first.

    - Warren
     
  12. Sep 12, 2004 #11
    Hurkyl, you should seriously consider identifying irrelevant parts of that array. Even the most trivial processing (say, find the sum modulo 2^32) of an array with the size 2^32 will take a LOT of time.
     
  13. Sep 13, 2004 #12
    rgrig is right. Secondly, with disk accesses it will be even slower.
     
  14. Sep 14, 2004 #13
    I might have exagerated a bit. The code:
    Code (Text):

    unsigned i;
    for (i = 0; i != unsigned(-1); ++i);
     
    runs in about 4.5 seconds on a Pentium 4 at 2.66GHz, when compiled with gcc version 3.3 with the flag -O2 (i.e. optimizations turned on).
     
  15. Sep 14, 2004 #14

    chroot

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Even so, the first thing anyone should ever do when facing an array of this size is to find ways to avoid processing the entire array. :smile:

    - Warren
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?



Similar Discussions: Big arrays
  1. Java Arrays (Replies: 22)

Loading...