# Mean centering of the covariance matrix in PCA

## Main Question or Discussion Point

Hi all,
I thought I posted this last night but have recieved no notification of it being moved or cant find it the thread I have started list.

I was wondering if you could help me understand how PCA, principal component analysis, works a little better. I have read often that it to get the best results using PCA you should mean centre the variables within your matrix first. I thought however that one method of calculating the principal components was the covariation matrix method where the eigenvalues and eigenvectors gives you the direction of the greatest variance within the matrix. I also assumed that the elements of the covariation matrix was calculated by using the following formula:

Cov=sum([Xi-Xmean][Yi-Ymean])/N-1

If i subtracted the mean from the original data matrix would it matter because I would get the same distribution regardless using the above calculation.

I hope some one can help

Thanks

Related Set Theory, Logic, Probability, Statistics News on Phys.org
well ive been thinking about it all day and have thought that in the original transformation of the data by multiplication with the eigenvalue to carryout a linear transform - in the uncentered cases this transform will be done with a vector coming from the origin at zero in all dimensions. In the mean centered cases the vector will be coming from the centroid of the data, now at zero. I can only imagine that this will affect the ransformation in such away that it better explains the varience within the dataset as the covarience matrix wont change and hence neither will the eigenvectors. do you think i am getting close?

I think mean-subtraction is recommended because Var[X]<=E[X^2]. For the n-d case I find it easier to understand PCA in more general terms using the (reduced) singular value decompostion where we write the data matrix X as
X = Y+PDQ'
where Y is some predefined matrix (e.g. row-repeated column means), D is square diagonal and P and Q are rectangular matrices with P'P=I=Q'Q. The columns of P and Q are the eigenvectors corresponding to the non-zero eigenvalues of XX' and X'X respectively, and the diagonal of D has the square roots of the non-zero eigenvalues. This representation works for both low-dimensional many-data and high-dimensional few-data problems.

I'm not sure if there is a simple relation between the eigenvectors for Y=0 vs Y=Xbar, but it should be possible to show that the largest eigenvalue decreases.