Dimensions of Covariance matrix (multiple observations)

Click For Summary
SUMMARY

The discussion centers on the computation of the sample covariance matrix using MATLAB's cov function. Participants clarify that the covariance matrix is defined as the expected value of the product of deviations from the mean, represented mathematically as \(\mathrm{E}[(\vec{X_i} - \vec{\mu_i})^t*(\vec{X_j} - \vec{\mu_j})]\). It is established that the sample covariance matrix differs from the population covariance matrix, particularly in the divisor used (m vs. m-1). The conversation emphasizes the importance of distinguishing between sample and population statistics when discussing covariance.

PREREQUISITES
  • Understanding of covariance matrices and their definitions
  • Familiarity with MATLAB and its cov function
  • Knowledge of statistical concepts such as sample vs. population statistics
  • Basic linear algebra, particularly vector operations
NEXT STEPS
  • Review MATLAB's documentation on the cov function for detailed implementation
  • Study the differences between sample covariance and population covariance
  • Learn about the mathematical derivation of covariance matrices
  • Explore statistical resources on estimators for variance and covariance
USEFUL FOR

Statisticians, data analysts, and MATLAB users who are working with covariance matrices and require clarity on the differences between sample and population statistics.

Tilde90
Messages
20
Reaction score
0
Suppose we have a mxn matrix, where each row is an observation and each column is a variable. The (i,j)-element of its covariance matrix is \mathrm{E}\begin{bmatrix}(\vec{X_i} - \vec{\mu_i})^t*(\vec{X_j} - \vec{\mu_j})\end{bmatrix}, where \vec{X_i} is the column vector corresponding to a variable (its elements are the observations) and \vec{\mu_i} is the corresponding mean vector formed by one repeated element which is the mean value of the variable, calculated from its observations. Hence, the covariance matrix is a symmetric nxn matrix.

Is the argument correct? Thank you for answering me.

Another question: could we just replace the "E sign" with the division by the total number of observation, m?
 
Last edited:
Physics news on Phys.org
Tilde90 said:
Is the argument correct?

The argument isn't clear. You don't state which of your claims are justified by a definition. An experienced person would understand that your final step is appealing to the commuattive law of multiplication, but you didn't say this and you didn't really mention the (j,i) element of the matrix at all.

Another question: could we just replace the "E sign" with the division by the total number of observation, m?

In the first place, a "sample covariance matrix" and "a covariance matrix for n random variables" are two different things. Things that are "sample...whatevers" are computed from data. Things that are "population...whatever" or "whatever ... of a random variable" are computed from the formulae of the distributions of the random variables that are involved. To add to the confusion, sometimes the sample values are considered as random variables themselves, so the "sample...whatever" becomes a random variable in its own right. You must explain which of these scenarios your are dealing with.

As an example, the "variance of a distribution" is not defined to be the same thing as "the sample variance" even if the sample is taken from that distribution. And the "sample variance" is not necessarily the same as an "estimator of the population variance" or "an estimator of the variance of the distribution".

The "E" symbol is generally used when we are dealing with random variables, not with a fixed set of numbers. The "E" symbol by iself is often ambiguous unless it is clear to the reader what probability distribution is being used to compute "E" with. Often you see "E" with a subscripted symbol attached to it to clarify this. If you are thinking of your data as random variables, then I think using "E" makes sense. If you are thinking of your data as a set of fixed numbers, it doesn't.

My guess is that you trying to prove something about the covariance matrix "of n random variables" by imagining that you have a fixed set of data and phrasing your argument in terms of the "sample convariance matrix". If that's what you are attempting, it isn't a valid way of doing the proof.
 
Thank you Stephen for your answer. I will try to make my point clear.

My problem is with the command "cov" in MATLAB, which allows a user to obtain a covariance matrix given a sample matrix of observations and variables. According to the documentation (http://www.mathworks.ch/ch/help/matlab/ref/cov.html), "for matrix input X, where each row is an observation, and each column is a variable, cov(X) is the covariance matrix."

In order to understand this, I looked at Wikipedia (http://en.wikipedia.org/wiki/Covariance_matrix#Definition). Avoiding the use of the "E" symbol, which is actually incorrect in this context, as we are dealing with sample matrices (I apologise for my previous message), could we say that the (i,j)-element of the sample covariance matrix cov(X) is [(\vec{X_i} - \vec{\mu_i})^t*(\vec{X_j} - \vec{\mu_j})]/m? And, if this is not correct, what is the definition of the covariance matrix, given a matrix of sample observations of different variables?
 
Last edited:
Only MATLAB documentation can tell you exactly what MATLAB does! You needs some sort of summation signs to express your thoughts in mathematics. ( Perhaps the summations are implicit in the language of MATLAB?).

I think you generally have the right idea, but there are some ambiguities in what people call a "sample covariance". You can also find threads on the forum where people argue about the definition of "sample variance".

The situation is this: There are various different formula for "estimators" of the variance of a distribution as a function of sample data. The situation is similar for covariances. The commonly used formulae differ by what divisor they use ( m vs m-1 vs m+1). ( I scanned the article http://en.wikipedia.org/wiki/Estimation_of_covariance_matrices and see it discusses n vs n-1. )

People who defend a particular estimator as "best" (and there are objective ways of defining this) often think this "proves" the same formula should be used for the sample covariance. However, from the point of view of "descriptive statistics", the goal of "sample covariance" is merely an attempt to describe data, not to estimate something. So you may find some books use one formula and some use another. I don't know what MATLAB decided to do.
 
Thank you for your help, Stephen. Following your link, I found this page (http://en.wikipedia.org/wiki/Sample_covariance_matrix#Sample_mean_and_covariance): the third equation, written again with vector formalism, proves my supposition, apart from the denominator, which is (m-1) as you suggested.

I also do not know what MATLAB decided to do, and I also do not know how to find it. :) However, it is good, at least, to have an idea regarding the computation of a sample covariance matrix.
 

Similar threads

  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 11 ·
Replies
11
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 2 ·
Replies
2
Views
4K
  • · Replies 1 ·
Replies
1
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 27 ·
Replies
27
Views
2K
  • · Replies 8 ·
Replies
8
Views
1K
  • · Replies 0 ·
Replies
0
Views
1K