Can I calculate the covariance matrix of a large set of data?

Click For Summary

Discussion Overview

The discussion revolves around the calculation of a covariance matrix for a large dataset in Matlab, specifically addressing the challenges posed by memory limitations when dealing with a stochastic process consisting of 211,302 observations and 50 variables. Participants explore potential approaches to circumvent these limitations and clarify the dimensions involved in the covariance matrix calculation.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant encounters a memory error when attempting to compute a 211,302x211,302 covariance matrix and seeks alternatives.
  • Another participant questions the necessity of the full covariance matrix and suggests that estimating specific elements may be more efficient.
  • A participant points out that a 50x50 covariance matrix for 211,302 realizations would be more reasonable than the proposed dimensions.
  • Clarification is provided that the original poster has 50 realizations of a stochastic process with 200,000 observations each, indicating a need to adjust the approach.
  • One participant expresses concern over the terminology used, suggesting that "50 realizations" may be misleading and that the dimensions of the covariance matrix should indeed be 50x50.
  • A participant shares their understanding of the data structure, emphasizing that each wind prediction is a realization measured at multiple locations.

Areas of Agreement / Disagreement

Participants generally agree that the dimensions of the covariance matrix may have been misinterpreted, leaning towards a 50x50 matrix rather than the initially proposed dimensions. However, there remains disagreement on the necessity of calculating the full covariance matrix versus estimating specific elements.

Contextual Notes

There are unresolved assumptions regarding the data structure and the implications of the covariance matrix's dimensions on the analysis. The discussion does not clarify the specific end goals of the covariance matrix calculation, which may influence the approach taken.

Frank Einstein
Messages
166
Reaction score
1
TL;DR
I want to calculate the covariance matrix of a large set of data. However, I get an error telling me that said matrix would be too big and therefore It cannot be done.
Hello everyone. I want to calculate the covariance matrix of a stochastic process using Matlab as

[CODE lang="matlab" title="Covariance matrix"]cov(listOfUVValues)

[/CODE]

being the dimensions of listOfUVValues 211302*50. I get the following error:

[CODE title="Error"]Requested 211302x211302 (332.7GB) array exceeds maximum array size preference. Creation of arrays greater than this limit may take a long time and cause MATLAB to become
unresponsive. See array size limit or preference panel for more information.

Error in cov (line 156)
c = (xc' * xc) ./ denom;
[/CODE]
;

Is there a way to go arround this limitation or is it impossible to do?

Any answer is appreciated.

Best regards.
 
Physics news on Phys.org
There is no getting around the fact that you are asking for a matrix that will need a ton of memory. What do you need the covariance matrix for?

If there is some other end goal then there may be a better approach that bypasses the need to compute the covariance matrix at all. I'm not sure how much information you would even get from a 211302x211302 matrix that has a rank of at most 50.

If you just need a few elements of the covariance matrix, then you can estimate those directly.

jason
 
  • Like
Likes   Reactions: Frank Einstein
Are you using the correct dimensions to represent each realisation? A 50x50 covariance matrix for 211k realisations sounds much more reasonable than vice versa.
 
  • Like
Likes   Reactions: FactChecker, Frank Einstein and jasonRF
Yes, I have 50 realizations of a stochastic process, 50 valriables and 200k observations of each. I am trying to calculate the covariance between the windspeed in the X and Y directions using data from the ECMWF. I gess I will have to limit the region or the resolution.

Thanks anyway for your comments
 
IMO, "50 realizations" is a misleading phrase. I interpret that phrase as 50 observations, each with a certain number of attributes (variables) recorded.
I think that you have your dimensions switched and, as @Orodruin suggested, your covariance matrix should be 50x50.
 
I am following this thread. I have 50 wind predictions, each measured at 200k places, thus, each wind prediction is a realization of a random variable. I don't know if that helps
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 13 ·
Replies
13
Views
4K