How to store a large data-set in a "grid" in Python/NumPy

  • Context: Python 
  • Thread starter Thread starter ergospherical
  • Start date Start date
  • Tags Tags
    Grid Numpy Python
Click For Summary

Discussion Overview

The discussion revolves around the best methods for storing large datasets in a grid format using Python and NumPy. Participants explore various approaches to efficiently generate, store, and retrieve data from potentially large n-dimensional matrices, considering both memory limitations and file formats.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • Some participants propose using a numpy.array to store the grid if it fits in memory, suggesting CSV for portability.
  • Others argue that CSV is inefficient for large datasets and recommend HDF5 for better performance and flexibility.
  • Concerns are raised about memory usage when writing to large numpy arrays, with suggestions to write data sequentially to a file to avoid crashes in environments with limited RAM.
  • Some participants mention the potential for using sparse arrays to mitigate size issues.
  • There is a discussion about the trade-offs between using CSV and other formats, with some emphasizing CSV's resilience against crashes while others highlight its limitations for larger datasets.
  • Participants note that the choice of file format may depend on the nature of the data and the computations being performed.

Areas of Agreement / Disagreement

Participants express differing views on the efficiency and appropriateness of CSV versus HDF5 for large datasets. There is no consensus on a single best approach, as opinions vary based on specific use cases and experiences.

Contextual Notes

Some participants highlight the importance of considering the sparsity of data and the potential need for batch processing when dealing with large datasets. Limitations regarding memory and processing capabilities in specific environments, such as Google Colab, are also mentioned.

ergospherical
Science Advisor
Homework Helper
Education Advisor
Insights Author
Messages
1,100
Reaction score
1,387
Say I have n parameters X1, X2, ..., Xn, and to each "grid-point" X = (X1, X2, ..., Xn) I want to associate a value f(X).
So in the end I will have a structure like an n-dimensional matrix.

Assume this data-set is going to be very large, i.e. each parameter Xi ranges over hundreds or thousands of values. I suppose I'd want to store this grid in a separate file once it is generated.

What's the best apparatus to achieve this in Python/NumPy? To generate the grid, and then be able to read sections of the grid efficiently?
 
Technology news on Phys.org
An array one thousand, by one thousand, is a mega-element array. Stack overflow is probable.
ergospherical said:
What's the best apparatus to achieve this in Python/NumPy? To generate the grid, and then be able to read sections of the grid efficiently?
That will depend on how sparse the data is in the grid. Maybe you do not need to store all that zero data.

Can you arrange the records needed together, to be read together from storage?
 
  • Like
Likes   Reactions: ergospherical
ergospherical said:
Say I have n parameters X1, X2, ..., Xn, and to each "grid-point" X = (X1, X2, ..., Xn) I want to associate a value f(X).
So in the end I will have a structure like an n-dimensional matrix.
Unless the values of each ## X_n ## are restricted to a finite set of integers I don't see the analogy to a grid or an nD matrix. Instead I see a 2D matrix with k rows (k is the number of observations) and n + 1 columns [X0, X1, ..., Xn-1, f(X)] (note I have renumbered Xi to start at zero as is the convention for almost all computing apart from Fortran).

ergospherical said:
Assume this data-set is going to be very large, i.e. each parameter Xi ranges over hundreds or thousands of values. I suppose I'd want to store this grid in a separate file once it is generated.

What's the best apparatus to achieve this in Python/NumPy?

To generate the grid, and then be able to read sections of the grid efficiently?
Assuming it will all fit in memory, almost certainly a numpy.array. This can be stored on disk as a CSV file which is probably better for portability than anything else.

If 8k(n + 1) > (available memory) then you could try an SQL database (with indexes on all Xi columns) but if this is too slow it's going to be difficult.
 
  • Like
Likes   Reactions: ergospherical
Thousands of values in each direction is -by modern standards- not large at all. A Numpy array can easily handle that.
If you want a more efficient file format than CSV you can have a look at HDF5, not quite as portable but widely supported and much more efficient.
 
  • Like
Likes   Reactions: Lischka and phyzguy
Is there a good way of reducing the RAM usage if I'm writing to a large numpy array? I've followed @pbuk's suggestion so far, writing to a 2d numpy array of size ##k \times (n+1)##. But I'm also using Google Colab and there's only 12.7GB of system RAM which is quickly used up, causing it to crash. Is there a way to write each row to a CSV file sequentially, instead of waiting for the entire numpy array to populate first?
 
Where is the data coming from? Can you create it a row at a time and append to a file?
 
Yeah, that works actually.
 
  • Like
Likes   Reactions: pbuk
Of course that does indicate that you are going to have problems reading it all back in again, at least in Google Colab. Options include processing in batches and transferring to a local workstation with a bit more oomph.
 
Yeah, I think I’m going to give up on Colab. I was only using it because it’s convenient to share progress with the rest of the team but that’s less important…
 
  • #10
If your data can use sparse arrays, the size problem will disappear.

Alternatively, maintain contact only with the rows and columns of the disk file that are needed at that time. Do that with two smaller arrays in memory, keep them smaller than the CPU data cache.
 
  • #11
Is there a reason for why you need to use CSV? it is very, very inefficient.
Many different file formats that were developed for handling large datasets; the abovementioned HDF5 is what I am used to (we use it because it is really well supported by Matlab and Python) but there are others.
CSV is very convenient for small datasets; but should in my view not be used for anything larger than a few hundred rows/columns.
 
  • #13
f95toli said:
Is there a reason for why you need to use CSV? it is very, very inefficient.
Many different file formats that were developed for handling large datasets; the abovementioned HDF5 is what I am used to (we use it because it is really well supported by Matlab and Python) but there are others.
CSV is very convenient for small datasets; but should in my view not be used for anything larger than a few hundred rows/columns.
CSV has a significant advantage over more structured data formats: it doesn't break when you crash. If you are doing all your computations in memory you don't need an efficient way of retrieving data from disk and so it is IMHO suitable for anything that is intended to fit in memory, much more than even a few hundred rows and columns!
 
  • #14
pbuk said:
If you are doing all your computations in memory you don't need an efficient way of retrieving data from disk and so it is IMHO suitable for anything that is intended to fit in memory, much more than even a few hundred rows and columns!
I have certainly used CSV and reading the entire file into memory for datasets with much more than a few hundred rows (I don't know that my column count has ever gotten that high). However, the computations I was doing on the data were fairly simple and didn't require a lot of additional objects that would take up a lot of additional memory. The latter might not be the case for the kinds of computations the OP is doing.
 
  • #15
pbuk said:
CSV has a significant advantage over more structured data formats: it doesn't break when you crash. If you are doing all your computations in memory you don't need an efficient way of retrieving data from disk and so it is IMHO suitable for anything that is intended to fit in memory, much more than even a few hundred rows and columns!
I have also used CSV for larger datasets. However, these days we try to avoid using it as a format.
There are a couple of reasons. Firstly, if you are working with dataset consisting of hundreds for rows and columns you are probably doing something that involves processing relatively large amount of data, and even if CSV work fine fo the dataset you are using initially it might not work if you suddenly find that you need scale to use thousands or rows and columns. It is often better to just use a more flexible dataset to start with. Once you are used to it HDF5 (and similar format) are not harder to use than .CSV.
Secondly, formats such as HDF5 are much more flexible when it comes to structuring the data, this is especially important if you e.g. find that you suddenly need to add another "dimension" and run calculations (or in our case measurements) vs. another variable) . This can be done using CSV as well, simply by saving multiple files; but processing thousands of CSV files, even if each one is relatively small (say ~a few MB in size) gets very slow. Moreover, HDF5 is way better for meta-data so used right it is much easy to understand the content of an old file even if the documentation has been lost.
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
11K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 7 ·
Replies
7
Views
5K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 11 ·
Replies
11
Views
2K