Allocating a huge boolean array fails

  • Thread starter Thread starter jk22
  • Start date Start date
  • Tags Tags
    Array
Click For Summary
Allocating a boolean array of size 3^27 on a Linux system results in a segmentation fault due to the limitations of stack memory and the requirement for contiguous memory allocation. The proposed solution is to use dynamic memory allocation with `new` in C++ or `malloc()` in C, which avoids stack overflow issues. Additionally, the size of the boolean data type is often one byte rather than one bit, leading to an unexpectedly large memory requirement of approximately 7 terabytes. For handling such large datasets, alternative strategies like using a bit array or a database are recommended, as well as considering parallel computing or cloud solutions for processing. Efficient memory management and algorithm design are crucial for tackling problems that require extensive resources.
jk22
Messages
732
Reaction score
25
Hello,

I'm trying to allocate a bool[3^27] array of bits in a Linux system with 1,2TB swap partition but the system says Segmentation Fault. How could I do this ? Thanx.
 
Technology news on Phys.org
An array needs to be allocated as a contiguous area of memory/disk.
You may have enough memory/disk available in total, but not a contiguous region of that size
 
  • Like
Likes phinds
However we do it, it should never result in a segmentation violation. If it can't be allocated, we'll get an error condition instead.
How are you allocating your array exactly?
And when do you get the segmentation fault exactly?
 
I think you're asking too much of your computer to allocate such a big array. I would instead look to see how it could be partitioned and perhaps stored in a database or a random access file.
 
I have written

Long long int size=pow3 (27);

Printf ("allocating");
Bool f [size];
Printf ("..ok");

Pow3(n); is a subroutine returning a long long int equals to 3^n.

It never arrives to print ok.

Would bool *f=new bool [size]; be more flexible even if the space is not contiguous ?
 
What is the problem you are trying to solve that you need such a large array?
 
jk22 said:
I have written

Long long int size=pow3 (27);

Printf ("allocating");
Bool f [size];
Printf ("..ok");

Pow3(n); is a subroutine returning a long long int equals to 3^n.

It never arrives to print ok.

Would bool *f=new bool [size]; be more flexible even if the space is not contiguous ?
That declaration puts the array in stack space, which is limited, causing a stack overflow. Instead I suggest to use indeed 'new', which puts it in dynamic memory. Then it will either succeed or it will throw a bad_alloc exception.
 
  • Like
Likes taverner and rbelli1
You left some important context out of your question, particularly what programming language you are trying to use.

However if you are trying to allocate large arrays in C or C++ then the following issues are relevant:
  • The size of the bool datatype depends on the compiler but it is most likely one byte, not one bit, so when you try to create an array of ##3^{27}## booleans you are probably asking to create an array nearly 7 terabytes in size rather than the ~900 gigabytes you might have been expecting. If you think storing one boolean per byte is wasteful and want to store them as bits then this requires some additional work (e.g. look up how to use "bit fields").
    [*]Like others have pointed out, arrays are stored as a contiguous block in memory. If you have e.g. 1000 megabytes of free memory then this is no guarantee that you will be able to allocate a 1000 megabyte array: it could be that there is a block of 400 megabytes free somewhere and another block of 600 megabytes free somewhere else. So if you need close to the total amount of memory you have available you may need to use some other data structure that doesn't store all data in one contiguous block.
    [*]When you use the bool array[SIZE] syntax inside a function it allocates the array on the stack. This is an area of memory reserved for local function variables. The amount of stack space is usually very limited (e.g. a few megabytes, though the compiler or operating system may offer a way to change this). There are some ways to allocate an array in heap memory which don't have this limitation:
    • Make the array a global variable (i.e., move the declaration outside the body of any function).
    • Declare the array static.
      [*]Use malloc() or similar or (in C++) the new keyword to allocate memory.

    These won't help with the limitations above, though.


A couple of additional points:
  • When you create an array using TYPE array[SIZE], the size should be a compile-time constant (a literal number or constant expression like 1024*1024 or a #defined constant or const variable). If you want to create an array whose size will be computed at runtime (like by your pow3() function) then you should use malloc() or new in C++.

    C (but not C++) supports variable-length arrays (arrays whose sizes don't need to be known until runtime) since the C99 standard and some C++ compilers (like the GNU compiler) support them as a nonstandard extension. But they're still stack-allocated (with the limitations that follow) and they're risky to use (and nonportable in C++) because there is no way to test if the allocation was successful. (malloc(), by contrast, returns a null pointer if it fails to allocate memory.)
    [*]
    jk22 said:
    Printf ("allocating");
    Bool f [size];
    Printf ("..ok");

    [...]It never arrives to print ok.
    In your case probably the array allocation failed, but in general not seeing it print "..ok" doesn't necessarily mean anything. The standard output stream is buffered in C and C++. This means calling printf() won't necessarily print anything immediately but instead save what you want to print into a buffer and wait until there is a certain amount of text in the buffer before actually printing it. You should call fflush(stdout) if you want to force printing to be done immediately, or print to standard error, which is unbuffered, instead of standard output. (Also, what are "Printf" and "Bool"? C and C++ like most languages are case sensitive. I am assuming you meant "printf" and "bool".)
 
Last edited:
  • Like
Likes 3301, QuantumQuest and Vanadium 50
jk22 said:
Hello,
I'm trying to allocate a bool[3^27] array of bits in a Linux system with 1,2TB swap partition but the system says Segmentation Fault. How could I do this ? Thanx.
Part of the skill involved in programming is to develop algorithms that fit within the resources available to you. Even it you managed to allocate such an array, the first thing you would probably want to do is clear it to all false values. That would take tens of minute if not hours.
I suggest you find a new attack.
 
  • #10
Also: swap (virtual memory) is really slow. It's hundreds or thousands of times slower than main memory. So even if you can get your code working be prepared for your computer to grind to a slow crawl, including any other applications you are running as the memory they're using is swapped out.
 
  • #11
If I read this thread correctly, and I assume that a boolean data object is 8 bits for your compiler:
327 = 7 625 597 484 987 or on the order of 1013 - bytes - somewhat less than ten trillion bytes. Allowing for the 1024/kB conversion factor you are still in the 8 TB range. You do not have 8TB of virtual memory.

[aside]This also looks like you are trying something with Graham's number.[/aside]

You have choices, assuming this is a rational thing to try in the first place:
Find a bignum library for your compiler and use it, there are several free libraries. Your compiler will thank you.
Find an alternate data representation - maybe use a bit array which gets you down to circa 2TB of virtual memory. Still a lot of disk.
 
  • #12
Thanks for your replies. I assumed a bool was 1 bit indeed hence 7/8 Tbytes.

In fact I fell upon this problem by wanting to check if any function f(x,y,z) can be decomposed in k(g(x,y),h(x,z)),l(y,z))

This decomposition seems okay, it's like all the projections on 2D planes of a higher dimensional space.*

I took x,y,z in 0,1 it's ok, but 0,1,2 is already too huge. In fact the number of functions f to check to have been obtained has complexity v^(m^d), where f(x1,...xd) is in {1,..v} and xi in {1,..m}. Hence this grows much faster than exponential or even factorial.

*Then I also tried for higher number of variables in 0,1 but it seems that dimension 7 is the highest dimension we could represent in this way (which is out of the scope of any computer we could imagine) But I think there are infinite many possibilities to decompose in 2 variables functions. (I somebody had references to this problem, thanks)
 
  • #13
jk22 said:
In fact I fell upon this problem by wanting to check if any function f(x,y,z) can be decomposed in k(g(x,y),h(x,z)),l(y,z))

Is there some constraint on these functions? E.g., why don't you consider $$\begin{eqnarray}
k(x, z, y) &=& f(x, y, z) \,, \nonumber \\
g(x, y) &=& x \,, \nonumber \\
h(x, z) &=& z \,, \nonumber \\
l(y, z) &=& y
\end{eqnarray}$$ a solution to this problem?
 
  • #14
wle said:
If you want to create an array whose size will be computed at runtime (like by your pow3() function) then you should use malloc() or new in C++.

Or use the standard C++ 'vector' data type, which allocates memory as necessary, e.g. 'std::vector<bool> f(size)'.
 
  • #15
wle said:
Is there some constraint on these functions? E.g., why don't you consider $$\begin{eqnarray}
wle said:
k(x, z, y) &=& f(x, y, z) \,, \nonumber \\
g(x, y) &=& x \,, \nonumber \\
h(x, z) &=& z \,, \nonumber \\
l(y, z) &=& y
\end{eqnarray}$$ a solution to this problem?

K shall be a function of 2 variables
I wrote it wrong its m (k ( g (x,y),h (x,y)),l (x,y)). Your solution is basically writing f (x,y,z)=k (g (x,y),z) which is not always possible since in this way you can generate 88 over 256 functions on 0,1
Yes thanks jtbell i tried new but i have to decompose [3^27]=[3^9]^3 which means a slight redecoding in base 3 easy but it takes a time for such a huge number of calls
 
Last edited:
  • #16
jtbell said:
Or use the standard C++ 'vector' data type, which allocates memory as necessary, e.g. 'std::vector<bool> f(size)'.
Fun fact, an std::vector<bool> is actually a specialization that is implemented as an array of bits.
As such it's not a 'true' STL container.
See for instance http://en.cppreference.com/w/cpp/container/vector_bool
 
  • #17
Whatever you are trying to do, setting up 327 bools is almost certainly not what you want. That's saving the result of 7 trillion calculations. If each calculation takes a microsecond, that's still 3 months.
 
  • Like
Likes sysprog
  • #18
I changed to smaller : 2^27 but even there it bugged. in fact the bug came from the usage of bool. I did with char and recomputed to compute the bits and it worked fine.
 
  • #19
Messages like #18 make me sad.

Message #8 was a great message in explaining what the problem is and how to fix it. But your are sticking to your guns and trying the same thing that didn't work last time.
 
  • Like
Likes ChrisVer
  • #20
wle said:
arrays are stored as a contiguous block in memory.
Only the virtual address space for an array is contiguous, the physical allocation can made of up of scattered chunks of 4096 byte physical memory pages.

wle said:
C (but not C++) supports variable-length arrays since the C99 standard and some C++ compilers support them as a nonstandard extension.
Microsoft and Visual Studio compilers do not support variable length arrays. They've somewhat kept up with C++ standards, but not C standards.
 
  • #21
You need to figure out what parts of the memory you need at any given time and only load that. You're trying to just load the whole thing into virtual memory and let the OS figure out what you need in actual RAM. Any software that has to open more than a gig or two of RAM, manages a lot of its own swapping (when an app does it it's usually called caching.) What on Earth are you trying to calculate with all of that data? My guess is that if you require that much memory, you need to figure out how to use parallelization and send it off to a server farm.

Other than things like physics simulations, the only thing I can think of that actually requires terabytes of information is raytracing for things like movies. Neither can be done or a regular person's computers. Simple things like that are done on supercomputers and things that take even more horsepower are farmed to companies like Amazon.
 
  • Like
Likes ChrisVer
  • #22
In fact having a mastercode that will overflow your memory consumption is not only discouraged practice but also wrong (it means that you are not working efficiently).
The only way to go about it is either revisit the way you approach your problem and try to be smart, or figure out a way to use cloud or cluster computing to split your (impossible) master-job into smaller mini-jobs that run on different machines in parallel.
 
  • #24
erratum : I'm trying to write $$f(x,y,z)$$ in the form $$m(g(x,z),h(y,z))$$ more precisely to find the ratio of such generated function over the whole in the case $$m:\{0,1,2\}^2 \rightarrow \{0,1\}$$ and $$h,g : \{0,1,2\}^2\rightarrow\{0,1,2\}$$
 
Last edited:
  • #25
jk22 said:
erratum : I'm trying to write $$f(x,y,z)$$ in the form $$m(g(x,z),h(y,z))$$ more precisely to find the ratio of such generated function over the whole in the case $$m:\{0,1,2\}^2 \rightarrow \{0,1\}$$ and $$h,g : \{0,1,2\}^2\rightarrow\{0,1,2\}$$
This is very different from what you first asked, about an array that contained 327 bits.

Your notation suggests to me that h and g each have one of 9 pairs of input values -- (0, 0), (0, 1), (0, 2), (1, 0), ..., (2, 2), and produce 0, 1, or 2 as the output. m has the same 9 pairs as input values, and produces 0 or 1 as its output. Maybe you could use small lookup tables or small arrays?
 
  • #26
If we count there are ##2^9## different m and ##3^9## different g and h.

Hence the loop is over ##2^9 3^{18}## different function composition.

Somehow i have to check if the obtained function f is different from the others hence an array of ##2^{27}## bits or 16 MB which is manageable.

In fact its the loops that use much time, like a week.

I wonder if its possible to count the functions without exhaustive search on a computer.
 
Last edited by a moderator:
  • #27
jk22 said:
If we count there are ##2^9## different m and ##3^9## different g and h.

Hence the loop is over ##2^9 3^{18}## different function composition.
Wouldn't that be ##2^9 3^9 = 6^9## different possibilities? That's just over 10,000,000 different possibilities.
 
  • #28
I think it's ##3^9## for g and ##3^9## for h then we have to multiply all.
 
Last edited by a moderator:
  • #29
jk22 said:
I think it's ##3^9## for g and ##3^9## for h then we have to multiply all.
You're right. I used m and either g or h instead of g and h. Anyway, that's just ##3^9 \cdot 3^9 = 9^9## or 387,420,489 different input pairs for m. Still a lot less than your original ##3^{27}##.

Since m produces either 0 or 1, I'm not sure you need to double the result above.

LaTeX tip: For plain old numbers, like ##3^9##, use a pair of ## characters rather than a pair of $$. The former are for inline LaTeX and the later for standalone LaTeX. There's no good reason that I can see for ##3^9##, for example, to be formatted on its own line with extra lines before and after.
 
  • #30
jk22 said:
erratum : I'm trying to write $$f(x,y,z)$$ in the form $$m(g(x,z),h(y,z))$$ more precisely to find the ratio of such generated function over the whole in the case $$m:\{0,1,2\}^2 \rightarrow \{0,1\}$$ and $$h,g : \{0,1,2\}^2\rightarrow\{0,1,2\}$$

Have you considered starting with a smaller problem?
Say with ##m:\{0,1\}^2 \rightarrow \{0,1\}## and ##h,g : \{0,1\}^2\rightarrow\{0,1\}##.
It means we have ##2^{2^3}=256## options for f and ##2^{2^2}=16## options for both g and h.
From there we can construct m and validate it, so that with brute force we need to verify ##256\cdot 16 \cdot 16 \cdot 8 = 2^{19} = 524288## values.
Still large but much more feasible.

Btw, I have the sneaking suspicion that it's not possible for all functions f, but I can't prove it yet.
It seems to me that since we reduce the domain of f from ##\{0,1,2\}^3## to the domain of m, which is ##\{0,1,2\}^2##, we cannot distinguish all possibilities.
The same should apply for the smaller domains.
 
  • Like
Likes jk22

Similar threads

  • · Replies 1 ·
Replies
1
Views
775
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 15 ·
Replies
15
Views
4K
  • · Replies 5 ·
Replies
5
Views
13K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 33 ·
2
Replies
33
Views
5K
  • · Replies 20 ·
Replies
20
Views
4K
  • · Replies 4 ·
Replies
4
Views
2K