C++ Program Suddenly Ends after Reading Huge Data

ecastro · Mar 27, 2020

I have written a C++ code in Visual Studio 2019 that requires an input tab-delimited text file and outputs a text file that is also tab-delimited. The data within the text file are stored in a vector and then it will perform calculations, whose results will then be written in a text file as well. The code works for a small set of data (like 100 data points), however when I try to read a huge set of data (around 30 million points), the code just suddenly ends as if the calculations were done, but the output files are empty; it doesn't return any errors, it just ends.

I first thought that it might be the format of the data but I took a small subset from the huge set of data and it works. I have also defined the stack reserve size to 500,000,000, still no luck. What could be the problem?

Thank you in advance.

PeterDonis · Mar 27, 2020

ecastro said:

What could be the problem?

Without seeing your code, how is anyone supposed to figure that out?

phinds · Mar 27, 2020

ecastro said:

What could be the problem?

Given the specificity with which you have given the details of the code, I'd say the answer is --- something.

ecastro · Mar 27, 2020

Sorry about this. Here is a snippet of the code.

C++:

    ifstream file;
    ofstream save_file1, save_file2;
    string file_name, line, file_cc, file_df;
    double J, min_J;
    int* c = nullptr;
    int* min_c = NULL;
    int data_point;
    int num_data = 0;
    double prog = 0;
    vector<int> data_set;

    cout << "Please enter the File Name: ";
    getline(cin, file_name);

    cout << "Please enter the File Name where to Save Data 1: ";
    getline(cin, file_cc);

    cout << "Please enter the File Name where to Save Data 2: ";
    getline(cin, file_df);

    const int num_cluster = 10;
    double mu[num_cluster];
    double min_mu[num_cluster];

    file.open(file_name);

    save_file1.open(file_cc);
    save_file2.open(file_df);

    if (!file)
    {
        cout << "Either the File doesn't exist or can't be opened!";
        exit(1);
    };

    if (file.is_open())
    {

        while (file.good())
        {
            getline(file, line, '\t');
            data_point = stoi(line);
            data_set.push_back(data_point);

            // Checking
            num_data = num_data + 1;
            cout << "Data Points Read: " << num_data << endl;
        }

        c = new int[data_set.size()];
        min_c = new int[data_set.size()];
        for (int K = 2; K <= num_cluster; K++)
        {
            for (int i = 0; i < 100; i++)
            {
               // Do something 1

              // Check progress
                prog = prog + 1;
                cout << prog << endl;
            }

            if (save_file1.is_open())
            {
                
                for (int i = 0; i < K; i++)
                {
                    if (i == K - 1)
                    {
                        save_file1 << min_mu[i] << endl;
                    }
                    else {
                        save_file1 << min_mu[i] << '\t';
                    }
                }
            }

            if (save_file2.is_open())
            {
                save_file2 << K << '\t' << min_J << endl;
            }
        }

        delete[] c;
        delete[] min_c;
    }

    file.close();

    save_file1.close(); save_file2.close();
}

Mark44 · Mar 27, 2020

I think that your program is overflowing the stack; namely your dataset vector object is trying to use 30 million X 4 bytes/int, or 120 million bytes of stack storage.

Here's an article that talks about the maximum size of stack storage for a thread - https://support.microsoft.com/en-us...ack-size-of-a-thread-that-is-created-in-a-nat. Although it's not current (2012) and is talking about IIS (Internet Information Service), the basic idea is applicable to why your program is behaving as it is. A solution they mention is to allocate memory from the heap rather than from the stack.
A quote from the article:

In Windows Server 2008 and higher, the maximum stack size of a thread running on 32-bit version of IIS is 256 KB, and on an x64 server is 512 KB.

jedishrfu · Mar 27, 2020

ecastro said:

Summary:: I have written a C++ code in Visual Studio 2019 and it works when reading a small amount of data but suddenly ends after reading a huge set of data.

I have written a C++ code in Visual Studio 2019 that requires an input tab-delimited text file and outputs a text file that is also tab-delimited. The data within the text file are stored in a vector and then it will perform calculations, whose results will then be written in a text file as well. The code works for a small set of data (like 100 data points), however when I try to read a huge set of data (around 30 million points), the code just suddenly ends as if the calculations were done, but the output files are empty; it doesn't return any errors, it just ends.

I first thought that it might be the format of the data but I took a small subset from the huge set of data and it works. I have also defined the stack reserve size to 500,000,000, still no luck. What could be the problem?

Thank you in advance.

Try with smaller amounts of data like say 1 million lines of input.

30 million data points --> 120MB with four byte floats or 240MB with eight byte doubles which should enough unless you are storing a lot more intermediate data.

I think you will have to add some logging to your program to see where it fails:

As an example, you could setup a line_counter and then output a dot after every 1000 data points or 10,000 data points.

C:

line_counter++;
if((line_counter%10000)==0) printf(".");

Mark44 · Mar 28, 2020

jedishrfu said:

30 million data points --> 120MB with four byte floats or 240MB with eight byte doubles which should enough unless you are storing a lot more intermediate data.

The code in question has an array of type int, at four bytes each, for a total of 120 million bytes. That's way too much for the stack for a program in Visual Studio. The per-thread limit on stack size that I cited earlier is a bit low, but not too far off.

Here's an example of a program that runs:

C:

#include <iostream>
const int ARR_SIZE = 250000;
int main()
{
     int Arr[ARR_SIZE];
     for (int i = 0; i < ARR_SIZE; i++)
{
Arr[i] = 2 * i + 1;
if (i % (ARR_SIZE/10) == 0) std::cout << "Index: " << i << '\n';
}
std::cout << "Done";
}

In this example, the array of 1 million bytes is allocated on the stack. If you double ARR_SIZE to 500,000, you get a run-time stack overflow exception.

If the array is dynamically allocated (using heap memory), the program does work with an array of 30,000,000 int elements.

jedishrfu · Mar 28, 2020

Good catch @Mark44 !

I primarily work in Java and was thinking heapsize not stack for this large array. We typically use 128MB minimum but have used 512MB when the app calls for it.

willem2 · Mar 30, 2020

Mark44 said:

In this example, the array of 1 million bytes is allocated on the stack. If you double ARR_SIZE to 500,000, you get a run-time stack overflow exception.

But the program uses new for c and min_c. If you use new, only a pointer to the array will be stored on the stack, and the array itself will be stored on the heap. data_set is a vector object, and it also will use new for allocations and store its data on the heap.
If you declare an array in a function, it will get allocated on the stack.

ecastro · Mar 30, 2020

Thank you for all the replies!

I have increased the heap reserve size to 50,000,000 and it still doesn't work. I counted the exact number of data points present through another program and it matches the number of data points that was stored in the vector, which I think means that allocating memory for the vector is not the problem.

Now I tried debugging the code using the huge data set and it throws a "std::bad_alloc at memory location", which I thought was because the computer I was using doesn't have enough memory. However when I transferred the code to another PC, I still get the error.

Here is the part of the code where the error occurs, which is part of a function:

C++:

double** x = NULL;
x = new double*[data_set.size()];

for (int i = 0; i < data_set.size(); i++)
{
    x[i] = new double[K];
}

"K" here is an integer and "data_set" is the data set, which has a size of around 30 million. This piece of code throws an error at the value of "K" equal to 2 and "i" equal to 22 million. I think it is better to note that the value of "K" reaches until 10.

willem2 · Mar 30, 2020

ecastro said:

I have increased the heap reserve size to 50,000,000 and it still doesn't work. I counted the exact number of data points present through another program and it matches the number of data points that was stored in the vector, which I think means that allocating memory for the vector is not the problem.

It certainly looks like you ran out of memory.

Are you compiling in x64 mode? A 32 bit program can only use 2GB, and I think you need more than that. dataset, c and min_c should be 240 Mb each already, and now x will need another 120 Mb for the pointers, and eventually 2.4 GB for the 30 million arrays of 10 doubles each.
There is also overhead if you do many small allocations., so it will be even more than that.

Are you sure you are also freeing up that memory? If you don't you will need even more. It's also very inefficient to use new and delete so often for very small blocks. Once you know you'll need an array of size data_set_size() x 10, you can declare it in one block and reuse that.

The heap reserve size only controls the initial size of the heap, it shouldn't limit the size of the heap.

Mark44 · Mar 30, 2020

willem2 said:

But the program uses new for c and min_c. If you use new, only a pointer to the array will be stored on the stack, and the array itself will be stored on the heap. data_set is a vector object, and it also will use new for allocations and store its data on the heap.

Yes, data_set is a vector object, but it is declared at what appears to be the top of main() (in post #4) like so: vector<int> data_set;
I believe this declaration causes data_set to be allocated on the stackj.
Numbers read from the input fie get added to the vector by this line in the while loop:
data_set.push_back(data_point);

willem2 said:

If you declare an array in a function, it will get allocated on the stack.

And the vector object is being declared in main(), I believe.

In the more recent code, post #10, things have become more complicated.

C:

double** x = NULL;
x = new double*[data_set.size()];

Here x is an array of pointers to doubles, essentially a two-dimension array, with each array a fixed size based on the vector on the stack.

It would be better to have a one-dimension vector allocated on the heap, rather than to use a stack-allocated object (data_set) and the two-dimensional heap object x.

The much simpler code example I showed in post #7 works with 30,000,000 int values, which would be equivalent to 15,000,000 double values. I didn't check for larger numbers of values.

Mark44 · Mar 30, 2020

Here's a more C++-like version of the code I had in post #7:

C:

#include <iostream>
const int ARR_SIZE = 50000000;  // 50,000,000
int main()
{
    double * vecPtr = new double[ARR_SIZE];
    for (int i = 0; i < ARR_SIZE; i++)
    {
        vecPtr[i] = 2 * i + 1;
        if (i % (ARR_SIZE/10) == 0) std::cout << "Index: " << i << '\n';
    }
     std::cout << "Done";
}

Output:

Code:

Index: 0
Index: 5000000
Index: 10000000
Index: 15000000
Index: 20000000
Index: 25000000
Index: 30000000
Index: 35000000
Index: 40000000
Index: 45000000
Done

willem2 · Mar 30, 2020

Mark44 said:

Yes, data_set is a vector object, but it is declared at what appears to be the top of main() (in post #4) like so: vector<int> data_set;
I believe this declaration causes data_set to be allocated on the stack.

No. what gets allocated on the stack is just a struct with the size of the vector, the allocated size, and a pointer to the data.
If you declare a vector in a function ( main() is also a function) , this struct will end up on the stack, but the data of the vector will be allocated on the heap by new in the constructor of the function.

If you declare a vector outside of a function the compiler will reserve space for the struct before the program runs, (not on the heap or the stack) and also add code to call the constructor which uses new to allocate the data on the heap.

Only declarations of arrays with a size known at compile time will be allocated on the stack if in a function.

C++:

int main () {
     int c[100]; //Only in this case will the data be on the stack

    vector<double> vect[100];  // here pointers to 100 vectors will be put on the stack, but the data of the
    // vectors is on the heap.

Mark44 said:
C:
double** x = NULL;
x = new double*[data_set.size()];
Here x is an array of pointers to doubles, essentially a two-dimension array, with each array a fixed size based on the vector on the stack.

It would be better to have a one-dimension vector allocated on the heap, rather than to use a stack-allocated object (data_set) and the two-dimensional heap object x.

the data of data_set isn't stack allocated. In c++ you can't allocate stack memory with a size not known at compile time. (unless you write your own storage allocator to use alloca() )
This is mainly inefficient because of the 30 million storage allocations. (and there could be more, because I can't see if they get deleted or reused for different values of K)

Mark44 · Mar 30, 2020

@willem2, I stand corrected on what I said about data_set being allocated on the stack. I didn't grasp what was going on under the hood in the vector< some_type> constructor. Thanks for clearing that up.

Tom.G · Mar 30, 2020

50 000 000 x 10 x 4 needs 31 bits of address.
Add program, etc. and I REALLY hope you are running a 64bit executable. :nb)

ecastro · Apr 1, 2020

Thank you for these insights!

willem2 said:

Are you compiling in x64 mode? A 32 bit program can only use 2GB, and I think you need more than that. dataset, c and min_c should be 240 Mb each already, and now x will need another 120 Mb for the pointers, and eventually 2.4 GB for the 30 million arrays of 10 doubles each.

Now that you mention it, the code does end when the memory reaches 2GB, so this might indeed be the problem and I need to compile it in x64 (which I still need to learn how; I'm still a novice in these kinds of things

)

willem2 said:

This is mainly inefficient because of the 30 million storage allocations. (and there could be more, because I can't see if they get deleted or reused for different values of K)

The "x" array is indeed deleted after I use it, i.e. it is deleted before "K" changes. However, as I was looking through the code, I think I can work around the code so that "x" only holds a single value, but I shall compile it first with x64.

strangerep · Apr 1, 2020

Geez,... so many errors and/or poor style, I hardly know where to start.

@ecastro : which code snippet is your latest attempt? I getting confused.

But here's some principles:

1. If invoking the operator new, use the "nothrow" option instead of relying on it throwing a bad_alloc exception. Cf.
http://www.cplusplus.com/reference/new/operator new/
Then check the return value from new. If it's null, the allocation failed and you can print out an error message with precise details.

(For this reason, I avoid the vector::push_back() function since it's return value is void. You've got to set up an exception handler around it to catch bad_alloc, and that becomes tedious if you want to print out the precise detail of where/why one's program failed.)

2. Check the return codes on all function calls, and print error messages accordingly.

3. What is the actual algorithm you're apply to this data? Do you really need to load it all into memory at once (presumably to perform some kind of multi-pass algorithm?), or could you just use one sequential read of the file?

4. Another possibility is to write a separate program that sequentially reads each item in the input file, and writes it out as binary to another file. Then your main program could simply mmap() the binary file (assuming it doesn't exceed your address space as others have suggested).

ecastro · Apr 3, 2020

strangerep said:

which code snippet is your latest attempt? I getting confused.

Hello, I haven't changed anything yet, the two snippets I have shown are part of the same algorithm, it just that "x" is being done in a function rather than in "main". However, I was successful in compiling the code in x64 and it's now working!

But I would like to ask a few questions (if it's okay) regarding your suggestions, they might help me in understanding the algorithm or in the near future. By the way, I am beginner in C++, so I might get lost and misunderstand some concepts.

strangerep said:

(For this reason, I avoid the vector::push_back() function since it's return value is void. You've got to set up an exception handler around it to catch bad_alloc, and that becomes tedious if you want to print out the precise detail of where/why one's program failed.)

The algorithm I have written is dependent on the size of the input (dependent in terms of the array sizes), so I need the vector::push_back() function, unless there is another way.

strangerep said:

Check the return codes on all function calls, and print error messages accordingly.

I usually check the return codes on the functions through a sample set which I know what the return code should be. Is this sufficient?

strangerep said:

What is the actual algorithm you're apply to this data? Do you really need to load it all into memory at once (presumably to perform some kind of multi-pass algorithm?), or could you just use one sequential read of the file?

The algorithm I wanted to do is K-means, so I need to load all data into memory, I think.

strangerep said:

Another possibility is to write a separate program that sequentially reads each item in the input file, and writes it out as binary to another file. Then your main program could simply mmap() the binary file (assuming it doesn't exceed your address space as others have suggested).

May I know what this does and how different it is from reading from the input file directly?

sysprog · Apr 3, 2020

ecastro said:

Thank you for these insights!

Now that you mention it, the code does end when the memory reaches 2GB, so this might indeed be the problem and I need to compile it in x64 (which I still need to learn how; I'm still a novice in these kinds of things )

The "x" array is indeed deleted after I use it, i.e. it is deleted before "K" changes. However, as I was looking through the code, I think I can work around the code so that "x" only holds a single value, but I shall compile it first with x64.

That looks like good thinking to me.

When I was a kid, we had 24-bit addressing and thought ourselves lucky to have 12 megs of hand-sewn core memory -- in general what we did when running up against the hardware constraints was break up the problem into sectors -- e.g. do 1 to 100,000, and then do 100,000 to 200,000 but not call it that -- we'd pretend it was 1 to 100,000 again, but this time with a 'b' stored somewhere instead of an 'a' -- we didn't have the luxury of having enough real memory to hold all the values, and sometimes if we'd go low on disk, we'd start storing interim values on tape . . .

I think that use of 64-bit addressing could help you to avoid such labor . . .

rbelli1 · Apr 3, 2020

willem2 said:

In c++ you can't allocate stack memory with a size not known at compile time.

This is no longer true in C99. It is especially useful when you have the heap shut off and you don't want to fiddle with the stack pointer manually.

BoB

Edit: you can use alloca() for the fiddling but you still have to manually manage the space.

sysprog · Apr 3, 2020

Well, the theoretical max for a 64-bit processor in most architectures is 16 exabytes, and if you can't fit all of your working-storage data into that much space, you might be doing something trendy, instead of right.

strangerep · Apr 3, 2020

ecastro said:

[...] However, I was successful in compiling the code in x64 and it's now working!

OK, well, maybe that's good enough for your purposes.

The algorithm I have written is dependent on the size of the input (dependent in terms of the array sizes), so I need the vector::push_back() function, unless there is another way.

I use very little of the standard C++ libraries, having written my own. But this is of no use to you, of course, and the simplest fix would be to use an exception handler to catch a bad_alloc exception thrown from inside vector::push_back().

I usually check the return codes on the functions through a sample set which I know what the return code should be. Is this sufficient?

I don't understand what you mean by "through a sample set which I know what the return code should be". Every function you call should have documentation on what return codes are possible, and maybe any associated "errno" values (or its MS equivalent), if applicable.

The algorithm I wanted to do is K-means, so I need to load all data into memory, I think.

I haven't programmed K-means but that's clearly a non-trivial exercise (NP-hard) in general, unless you're using one of the simplified variants.

May I know what this does and how different it is from reading from the input file directly?

mmap() just maps the input file into your program's address space, so you can read it like an array or string. In some cases, this can be more efficient since it saves doing a copy into heap space. But if you must perform multiple passes over the binary data, then mmap()'ing the input file probably won't help much.

BTW, mmap() is called something else under Windows, which some googling might reveal, just for curiosity -- but don't spend too much time on this since achieving an efficient K-means implementation acting on the in-memory binary data is likely to be more important.

C++ Program Suddenly Ends after Reading Huge Data

1. Why does my C++ program suddenly end after reading huge data?

2. How can I prevent my C++ program from ending abruptly when reading large amounts of data?

3. Is there a maximum amount of data that a C++ program can handle?

4. Can the compiler affect the sudden end of my C++ program?

5. How can I troubleshoot and fix the sudden end of my C++ program when reading huge data?

Similar threads

Hot Threads

Recent Insights