Python How to store a large data-set in a "grid" in Python/NumPy

  • Thread starter Thread starter ergospherical
  • Start date Start date
  • Tags Tags
    Grid Numpy Python
AI Thread Summary
The discussion centers on efficiently generating and storing large n-dimensional grids in Python using NumPy. Participants emphasize the importance of considering data sparsity, suggesting that not all zero data needs to be stored. For large datasets, a numpy.array is recommended, which can be saved as a CSV for portability, though CSV is deemed inefficient for larger datasets. Alternatives like HDF5 are suggested for better performance and flexibility, especially when dealing with extensive data and metadata. There are concerns about memory limitations, particularly in environments like Google Colab, prompting suggestions to write data sequentially to files or use sparse arrays to manage size. The conversation highlights the trade-offs between convenience and efficiency in data storage formats, advocating for more robust solutions as dataset sizes increase.
ergospherical
Science Advisor
Homework Helper
Education Advisor
Insights Author
Messages
1,097
Reaction score
1,384
Say I have n parameters X1, X2, ..., Xn, and to each "grid-point" X = (X1, X2, ..., Xn) I want to associate a value f(X).
So in the end I will have a structure like an n-dimensional matrix.

Assume this data-set is going to be very large, i.e. each parameter Xi ranges over hundreds or thousands of values. I suppose I'd want to store this grid in a separate file once it is generated.

What's the best apparatus to achieve this in Python/NumPy? To generate the grid, and then be able to read sections of the grid efficiently?
 
Technology news on Phys.org
An array one thousand, by one thousand, is a mega-element array. Stack overflow is probable.
ergospherical said:
What's the best apparatus to achieve this in Python/NumPy? To generate the grid, and then be able to read sections of the grid efficiently?
That will depend on how sparse the data is in the grid. Maybe you do not need to store all that zero data.

Can you arrange the records needed together, to be read together from storage?
 
  • Like
Likes ergospherical
ergospherical said:
Say I have n parameters X1, X2, ..., Xn, and to each "grid-point" X = (X1, X2, ..., Xn) I want to associate a value f(X).
So in the end I will have a structure like an n-dimensional matrix.
Unless the values of each ## X_n ## are restricted to a finite set of integers I don't see the analogy to a grid or an nD matrix. Instead I see a 2D matrix with k rows (k is the number of observations) and n + 1 columns [X0, X1, ..., Xn-1, f(X)] (note I have renumbered Xi to start at zero as is the convention for almost all computing apart from Fortran).

ergospherical said:
Assume this data-set is going to be very large, i.e. each parameter Xi ranges over hundreds or thousands of values. I suppose I'd want to store this grid in a separate file once it is generated.

What's the best apparatus to achieve this in Python/NumPy?

To generate the grid, and then be able to read sections of the grid efficiently?
Assuming it will all fit in memory, almost certainly a numpy.array. This can be stored on disk as a CSV file which is probably better for portability than anything else.

If 8k(n + 1) > (available memory) then you could try an SQL database (with indexes on all Xi columns) but if this is too slow it's going to be difficult.
 
  • Like
Likes ergospherical
Thousands of values in each direction is -by modern standards- not large at all. A Numpy array can easily handle that.
If you want a more efficient file format than CSV you can have a look at HDF5, not quite as portable but widely supported and much more efficient.
 
  • Like
Likes Lischka and phyzguy
Is there a good way of reducing the RAM usage if I'm writing to a large numpy array? I've followed @pbuk's suggestion so far, writing to a 2d numpy array of size ##k \times (n+1)##. But I'm also using Google Colab and there's only 12.7GB of system RAM which is quickly used up, causing it to crash. Is there a way to write each row to a CSV file sequentially, instead of waiting for the entire numpy array to populate first?
 
Where is the data coming from? Can you create it a row at a time and append to a file?
 
Yeah, that works actually.
 
Of course that does indicate that you are going to have problems reading it all back in again, at least in Google Colab. Options include processing in batches and transferring to a local workstation with a bit more oomph.
 
Yeah, I think I’m going to give up on Colab. I was only using it because it’s convenient to share progress with the rest of the team but that’s less important…
 
  • #10
If your data can use sparse arrays, the size problem will disappear.

Alternatively, maintain contact only with the rows and columns of the disk file that are needed at that time. Do that with two smaller arrays in memory, keep them smaller than the CPU data cache.
 
  • #11
Is there a reason for why you need to use CSV? it is very, very inefficient.
Many different file formats that were developed for handling large datasets; the abovementioned HDF5 is what I am used to (we use it because it is really well supported by Matlab and Python) but there are others.
CSV is very convenient for small datasets; but should in my view not be used for anything larger than a few hundred rows/columns.
 
  • #13
f95toli said:
Is there a reason for why you need to use CSV? it is very, very inefficient.
Many different file formats that were developed for handling large datasets; the abovementioned HDF5 is what I am used to (we use it because it is really well supported by Matlab and Python) but there are others.
CSV is very convenient for small datasets; but should in my view not be used for anything larger than a few hundred rows/columns.
CSV has a significant advantage over more structured data formats: it doesn't break when you crash. If you are doing all your computations in memory you don't need an efficient way of retrieving data from disk and so it is IMHO suitable for anything that is intended to fit in memory, much more than even a few hundred rows and columns!
 
  • #14
pbuk said:
If you are doing all your computations in memory you don't need an efficient way of retrieving data from disk and so it is IMHO suitable for anything that is intended to fit in memory, much more than even a few hundred rows and columns!
I have certainly used CSV and reading the entire file into memory for datasets with much more than a few hundred rows (I don't know that my column count has ever gotten that high). However, the computations I was doing on the data were fairly simple and didn't require a lot of additional objects that would take up a lot of additional memory. The latter might not be the case for the kinds of computations the OP is doing.
 
  • #15
pbuk said:
CSV has a significant advantage over more structured data formats: it doesn't break when you crash. If you are doing all your computations in memory you don't need an efficient way of retrieving data from disk and so it is IMHO suitable for anything that is intended to fit in memory, much more than even a few hundred rows and columns!
I have also used CSV for larger datasets. However, these days we try to avoid using it as a format.
There are a couple of reasons. Firstly, if you are working with dataset consisting of hundreds for rows and columns you are probably doing something that involves processing relatively large amount of data, and even if CSV work fine fo the dataset you are using initially it might not work if you suddenly find that you need scale to use thousands or rows and columns. It is often better to just use a more flexible dataset to start with. Once you are used to it HDF5 (and similar format) are not harder to use than .CSV.
Secondly, formats such as HDF5 are much more flexible when it comes to structuring the data, this is especially important if you e.g. find that you suddenly need to add another "dimension" and run calculations (or in our case measurements) vs. another variable) . This can be done using CSV as well, simply by saving multiple files; but processing thousands of CSV files, even if each one is relatively small (say ~a few MB in size) gets very slow. Moreover, HDF5 is way better for meta-data so used right it is much easy to understand the content of an old file even if the documentation has been lost.
 

Similar threads

Replies
1
Views
11K
Replies
7
Views
5K
Replies
6
Views
3K
Replies
4
Views
2K
Replies
2
Views
2K
Replies
3
Views
2K
Back
Top