# Dimensions of Covariance matrix (multiple observations)

1. Sep 28, 2012

### Tilde90

Suppose we have a $mxn$ matrix, where each row is an observation and each column is a variable. The $(i,j)$-element of its covariance matrix is $\mathrm{E}\begin{bmatrix}(\vec{X_i} - \vec{\mu_i})^t*(\vec{X_j} - \vec{\mu_j})\end{bmatrix}$, where $\vec{X_i}$ is the column vector corresponding to a variable (its elements are the observations) and $\vec{\mu_i}$ is the corresponding mean vector formed by one repeated element which is the mean value of the variable, calculated from its observations. Hence, the covariance matrix is a symmetric $nxn$ matrix.

Is the argument correct? Thank you for answering me.

Another question: could we just replace the "$E$ sign" with the division by the total number of observation, $m$?

Last edited: Sep 28, 2012
2. Sep 29, 2012

### Stephen Tashi

The argument isn't clear. You don't state which of your claims are justified by a definition. An experienced person would understand that your final step is appealing to the commuattive law of multiplication, but you didn't say this and you didn't really mention the (j,i) element of the matrix at all.

In the first place, a "sample covariance matrix" and "a covariance matrix for n random variables" are two different things. Things that are "sample....whatevers" are computed from data. Things that are "population...whatever" or "whatever ... of a random variable" are computed from the formulae of the distributions of the random variables that are involved. To add to the confusion, sometimes the sample values are considered as random variables themselves, so the "sample...whatever" becomes a random variable in its own right. You must explain which of these scenarios your are dealing with.

As an example, the "variance of a distribution" is not defined to be the same thing as "the sample variance" even if the sample is taken from that distribution. And the "sample variance" is not necessarily the same as an "estimator of the population variance" or "an estimator of the variance of the distribution".

The "E" symbol is generally used when we are dealing with random variables, not with a fixed set of numbers. The "E" symbol by iself is often ambiguous unless it is clear to the reader what probability distribution is being used to compute "E" with. Often you see "E" with a subscripted symbol attached to it to clarify this. If you are thinking of your data as random variables, then I think using "E" makes sense. If you are thinking of your data as a set of fixed numbers, it doesn't.

My guess is that you trying to prove something about the covariance matrix "of n random variables" by imagining that you have a fixed set of data and phrasing your argument in terms of the "sample convariance matrix". If that's what you are attempting, it isn't a valid way of doing the proof.

3. Sep 30, 2012

### Tilde90

Thank you Stephen for your answer. I will try to make my point clear.

My problem is with the command "cov" in MATLAB, which allows a user to obtain a covariance matrix given a sample matrix of observations and variables. According to the documentation (http://www.mathworks.ch/ch/help/matlab/ref/cov.html), "for matrix input X, where each row is an observation, and each column is a variable, $cov(X)$ is the covariance matrix."

In order to understand this, I looked at Wikipedia (http://en.wikipedia.org/wiki/Covariance_matrix#Definition). Avoiding the use of the "$E$" symbol, which is actually incorrect in this context, as we are dealing with sample matrices (I apologise for my previous message), could we say that the $(i,j)$-element of the sample covariance matrix $cov(X)$ is $[(\vec{X_i} - \vec{\mu_i})^t*(\vec{X_j} - \vec{\mu_j})]/m$? And, if this is not correct, what is the definition of the covariance matrix, given a matrix of sample observations of different variables?

Last edited: Sep 30, 2012
4. Sep 30, 2012

### Stephen Tashi

Only MATLAB documentation can tell you exactly what MATLAB does! You needs some sort of summation signs to express your thoughts in mathematics. ( Perhaps the summations are implicit in the language of MATLAB?).

I think you generally have the right idea, but there are some ambiguities in what people call a "sample covariance". You can also find threads on the forum where people argue about the definition of "sample variance".

The situation is this: There are various different formula for "estimators" of the variance of a distribution as a function of sample data. The situation is similar for covariances. The commonly used formulae differ by what divisor they use ( m vs m-1 vs m+1). ( I scanned the article http://en.wikipedia.org/wiki/Estimation_of_covariance_matrices and see it discusses n vs n-1. )

People who defend a particular estimator as "best" (and there are objective ways of defining this) often think this "proves" the same formula should be used for the sample covariance. However, from the point of view of "descriptive statistics", the goal of "sample covariance" is merely an attempt to describe data, not to estimate something. So you may find some books use one formula and some use another. I don't know what MATLAB decided to do.

5. Sep 30, 2012

### Tilde90

Thank you for your help, Stephen. Following your link, I found this page (http://en.wikipedia.org/wiki/Sample_covariance_matrix#Sample_mean_and_covariance): the third equation, written again with vector formalism, proves my supposition, apart from the denominator, which is $(m-1)$ as you suggested.

I also do not know what MATLAB decided to do, and I also do not know how to find it. :) However, it is good, at least, to have an idea regarding the computation of a sample covariance matrix.