How to store a large data-set in a "grid" in Python/NumPy

ergospherical · Jan 29, 2024

Say I have n parameters X1, X2, ..., Xn, and to each "grid-point" X = (X1, X2, ..., Xn) I want to associate a value f(X).
So in the end I will have a structure like an n-dimensional matrix.

Assume this data-set is going to be very large, i.e. each parameter Xi ranges over hundreds or thousands of values. I suppose I'd want to store this grid in a separate file once it is generated.

What's the best apparatus to achieve this in Python/NumPy? To generate the grid, and then be able to read sections of the grid efficiently?

Baluncore · Jan 29, 2024

An array one thousand, by one thousand, is a mega-element array. Stack overflow is probable.

ergospherical said:

What's the best apparatus to achieve this in Python/NumPy? To generate the grid, and then be able to read sections of the grid efficiently?

That will depend on how sparse the data is in the grid. Maybe you do not need to store all that zero data.

Can you arrange the records needed together, to be read together from storage?

pbuk · Jan 29, 2024

ergospherical said:

Say I have n parameters X1, X2, ..., Xn, and to each "grid-point" X = (X1, X2, ..., Xn) I want to associate a value f(X).
So in the end I will have a structure like an n-dimensional matrix.

Unless the values of each ## X_n ## are restricted to a finite set of integers I don't see the analogy to a grid or an nD matrix. Instead I see a 2D matrix with k rows (k is the number of observations) and n + 1 columns [X0, X1, ..., Xn-1, f(X)] (note I have renumbered Xi to start at zero as is the convention for almost all computing apart from Fortran).

ergospherical said:

Assume this data-set is going to be very large, i.e. each parameter Xi ranges over hundreds or thousands of values. I suppose I'd want to store this grid in a separate file once it is generated.

What's the best apparatus to achieve this in Python/NumPy?

To generate the grid, and then be able to read sections of the grid efficiently?

Assuming it will all fit in memory, almost certainly a numpy.array. This can be stored on disk as a CSV file which is probably better for portability than anything else.

If 8k(n + 1) > (available memory) then you could try an SQL database (with indexes on all Xi columns) but if this is too slow it's going to be difficult.

f95toli · Jan 30, 2024

Thousands of values in each direction is -by modern standards- not large at all. A Numpy array can easily handle that.
If you want a more efficient file format than CSV you can have a look at HDF5, not quite as portable but widely supported and much more efficient.

ergospherical · Feb 1, 2024

Is there a good way of reducing the RAM usage if I'm writing to a large numpy array? I've followed @pbuk's suggestion so far, writing to a 2d numpy array of size ##k \times (n+1)##. But I'm also using Google Colab and there's only 12.7GB of system RAM which is quickly used up, causing it to crash. Is there a way to write each row to a CSV file sequentially, instead of waiting for the entire numpy array to populate first?

pbuk · Feb 1, 2024

Where is the data coming from? Can you create it a row at a time and append to a file?

ergospherical · Feb 1, 2024

Yeah, that works actually.

pbuk · Feb 1, 2024

Of course that does indicate that you are going to have problems reading it all back in again, at least in Google Colab. Options include processing in batches and transferring to a local workstation with a bit more oomph.

ergospherical · Feb 1, 2024

Yeah, I think I’m going to give up on Colab. I was only using it because it’s convenient to share progress with the rest of the team but that’s less important…

Baluncore · Feb 1, 2024

If your data can use sparse arrays, the size problem will disappear.

Alternatively, maintain contact only with the rows and columns of the disk file that are needed at that time. Do that with two smaller arrays in memory, keep them smaller than the CPU data cache.

f95toli · Feb 2, 2024

Is there a reason for why you need to use CSV? it is very, very inefficient.
Many different file formats that were developed for handling large datasets; the abovementioned HDF5 is what I am used to (we use it because it is really well supported by Matlab and Python) but there are others.
CSV is very convenient for small datasets; but should in my view not be used for anything larger than a few hundred rows/columns.

ergospherical · Feb 2, 2024

Yes I'm writing to HDF5 now

pbuk · Feb 2, 2024

f95toli said:

Is there a reason for why you need to use CSV? it is very, very inefficient.
Many different file formats that were developed for handling large datasets; the abovementioned HDF5 is what I am used to (we use it because it is really well supported by Matlab and Python) but there are others.
CSV is very convenient for small datasets; but should in my view not be used for anything larger than a few hundred rows/columns.

CSV has a significant advantage over more structured data formats: it doesn't break when you crash. If you are doing all your computations in memory you don't need an efficient way of retrieving data from disk and so it is IMHO suitable for anything that is intended to fit in memory, much more than even a few hundred rows and columns!

PeterDonis · Feb 2, 2024

pbuk said:

If you are doing all your computations in memory you don't need an efficient way of retrieving data from disk and so it is IMHO suitable for anything that is intended to fit in memory, much more than even a few hundred rows and columns!

I have certainly used CSV and reading the entire file into memory for datasets with much more than a few hundred rows (I don't know that my column count has ever gotten that high). However, the computations I was doing on the data were fairly simple and didn't require a lot of additional objects that would take up a lot of additional memory. The latter might not be the case for the kinds of computations the OP is doing.

f95toli · Feb 6, 2024

pbuk said:

CSV has a significant advantage over more structured data formats: it doesn't break when you crash. If you are doing all your computations in memory you don't need an efficient way of retrieving data from disk and so it is IMHO suitable for anything that is intended to fit in memory, much more than even a few hundred rows and columns!

I have also used CSV for larger datasets. However, these days we try to avoid using it as a format.
There are a couple of reasons. Firstly, if you are working with dataset consisting of hundreds for rows and columns you are probably doing something that involves processing relatively large amount of data, and even if CSV work fine fo the dataset you are using initially it might not work if you suddenly find that you need scale to use thousands or rows and columns. It is often better to just use a more flexible dataset to start with. Once you are used to it HDF5 (and similar format) are not harder to use than .CSV.
Secondly, formats such as HDF5 are much more flexible when it comes to structuring the data, this is especially important if you e.g. find that you suddenly need to add another "dimension" and run calculations (or in our case measurements) vs. another variable) . This can be done using CSV as well, simply by saving multiple files; but processing thousands of CSV files, even if each one is relatively small (say ~a few MB in size) gets very slow. Moreover, HDF5 is way better for meta-data so used right it is much easy to understand the content of an old file even if the documentation has been lost.

How to store a large data-set in a "grid" in Python/NumPy

Similar threads

Hot Threads

Recent Insights