WWGD said:
Sorry, I don't mean to hijack the thread, it is just that I am curious about a related issue: the interpretation of eigenvalues in correlation/covariance matrices. These supposedly describe directions of maximal variability of the data, but I just cannot see it at this point. I thought since the OP seems satisfied with the answers given, it may make sense to extend the thread beyond the original scope.
I'm not totally sure I understand your question as they are a lot of interpretations here. In all cases I assume we're working with centered (read: zero mean, by column) data. I also assume we're operating in reals.
If you have your data in a matrix ##\mathbf A##, and you have some arbitrary vector ##\mathbf x##, where ##\big \vert \big \vert \mathbf x \big \vert \big \vert_2^{2} = 1##, then to maximize ##\big \vert \big \vert \mathbf{Ax} \big \vert \big \vert_2^{2}##, you'd allocate entirely to ##\lambda_1##, ##\mathbf v_1## the largest eigenpair in ##\mathbf A^T \mathbf A## (aka largest singular value (squared) of ##\mathbf A## and the associated right singular vector). This is a quadratic form interpretation of your answer. Typically people prove this with a diagonalization argument or a Lagrange multiplier argument. I assume the eigenvalues are well ordered for this symmetric positive (semi)definite covariance matrix, so ##\lambda_1 \geq \lambda_2 \geq ... \lambda_n \geq 0##. Where
##\mathbf A =
\bigg[\begin{array}{c|c|c|c}
\mathbf a_1 & \mathbf a_2 &\cdots & \mathbf a_{n}
\end{array}\bigg]
##
That is, ##\mathbf a_j## refers to the jth feature column in ##\mathbf A##
using the interpretation of matrix vector multiplication as the scaled sum across the column space of a matrix, we see that ##\mathbf {Ax} = x_1 *\mathbf a_1 + x_2 *\mathbf a_2 + ... + x_n *\mathbf a_n## .
Thus when someone asks to do a constrained maximization of ##\big \vert \big \vert \mathbf{Ax} \big \vert \big \vert_2^{2}##, what they are saying is come up with the vector that is a linear combination of features from data matrix ##\mathbf A## that has the maximal length, subject to the constraint that ##x_1^2 + x_2^2 + ... + x_n^2 = 1## (or some other constant > 0, but we use one for simplicity here). Since all features are zero mean (i.e. you centered your data), what you have done is extract a vector with the highest second moment / variance from your features -- again subject to the constraint ##x_1^2 + x_2^2 + ... + x_n^2 = 1##.
here is another interpretation
If you wanted to low rank approximate -- say rank one -- your matrix ##\mathbf A##, and you were using ##\big \vert \big \vert \mathbf A \big \vert \big \vert_F^{2}## as your ruler (i.e. sum up the squared value of everything in ##\mathbf A## which is a generalization of a L2 norm on vectors), you'd also allocate entirely to ##\lambda_1##, where ##\big \vert \big \vert \mathbf A \big \vert \big \vert_F^{2} = trace\big(\mathbf A^T \mathbf A\big) = \lambda_1 + \lambda_2 + ... + \lambda_n##, where each associated eigenvector is mutually orthonormal, so we have a clean partition, and for each eigenvalue your allocate to, you increase the rank of your approximation, thus for rank 2 approximation you'd allocate to ##\lambda_1## and ##\lambda_2## and so forth.