Principal component analysis and data compression in Machine Learning

Click For Summary
SUMMARY

This discussion focuses on the implementation of Principal Component Analysis (PCA) for data compression on an m x n matrix X, where m represents data points and n represents features. Two approaches are compared: using the singular value decomposition (SVD) directly on the data matrix X and on the covariance matrix of X. The consensus is that while both methods are valid, using the covariance matrix retains variability information but may lose the original data values. The importance of data whitening in PCA is emphasized, particularly when dealing with features of different scales.

PREREQUISITES
  • Understanding of Principal Component Analysis (PCA)
  • Familiarity with Singular Value Decomposition (SVD)
  • Knowledge of covariance matrices and their properties
  • Experience with data preprocessing techniques, including data whitening
NEXT STEPS
  • Explore the implementation of PCA using MATLAB's PCA function and PCACOV
  • Learn about data whitening techniques in PCA
  • Investigate the implications of using covariance matrices versus raw data in PCA
  • Study the relationship between variance, covariance, and information retention in datasets
USEFUL FOR

Data scientists, machine learning practitioners, and statisticians interested in dimensionality reduction and data compression techniques using PCA.

Wille
Messages
16
Reaction score
4
TL;DR
I wonder how to accurately perform data compression on the m x n matrix X using PCA. I have seen both X_k=X*W_k (k = new number of dimensions < n) where W comes from [U,S,W]=svd(X), but I have also seen X_compress=X*W_k with W from [U,S,W]=svd((1/m)*X^T*X), i.e. the svd performed on the covariance matrix of X. Which is correct? When I do these two techniques I do not get the same W.
I wonder how to accurately perform data compression on the m x n matrix X using PCA. Each row is a data point, and each column is a feature.
So m data points with n features. If I like to go k < n dimensions, how is the correct way of doing so? How to I accurately create the matrix W_k, which is the first k columns from the matrix W, and then create the compressed data X_k=X*W_k?

I have seen two approaches:
One as X_k=X*W_k where W comes from [U,S,W]=svd(X)
(this is from Wikipedia https://en.wikipedia.org/wiki/Principal_component_analysis )

but I have also seen
X_compress=X*W_k with W from [U,S,W]=svd((1/m)*X^T*X), i.e. the svd performed on the covariance matrix of X.
(seen in an online Machine Learning course)

Which is correct? When I do these two techniques I do not get the same W, i.e. not the same result for X_k=X*W_k.

Thanks.
 
Technology news on Phys.org
Wouldn't using the covariance matrix lose the information about the actual X values?
 
In Matlab, the PCA function whitens the data to have unit variance, whereas PCACOV (by acting on the covariance matrix) preserves the original scaling - could this be the issue?
 
BWV said:
In Matlab, the PCA function whitens the data to have unit variance, whereas PCACOV (by acting on the covariance matrix) preserves the original scaling - could this be the issue?
I don't know. Do you mean the svd function? Matlab or not, which of the two approaches is correct to use?
 
Wille said:
I don't know. Do you mean the svd function? Matlab or not, which of the two approaches is correct to use?
Neither is wrong - it just depends on the context - typically PCA on raw data is whitened first, as difference in scale (say different units) can create problems. PCA on cov matrix is of course not whitened (otherwise you would be doing it on the correlation matrix). Not sure why you would use the cov matrix for data compression as you would lose the original data and just retain a diagonalized cov matrix
 
  • Like
Likes   Reactions: Wille and FactChecker
BWV said:
Not sure why you would use the cov matrix for data compression as you would lose the original data and just retain a diagonalized cov matrix
Exactly. You would lose the information of the actual values of the X variables. You would only have information about how they vary from their means. In other words, you would have information about the shape of the scattered data, but not know where the data is centered. I guess you could add back in a matrix of means, but I think that would unnecessarily complicate things.
 
BWV said:
Neither is wrong - it just depends on the context - typically PCA on raw data is whitened first, as difference in scale (say different units) can create problems. PCA on cov matrix is of course not whitened (otherwise you would be doing it on the correlation matrix). Not sure why you would use the cov matrix for data compression as you would lose the original data and just retain a diagonalized cov matrix
Ok. I found this explanation:
https://www.quora.com/Why-does-PCA-...to-get-the-principal-components-of-features-X

It says:
"This is because covariance matrix accounts for variability in the dataset, and variability of the dataset is a way to summarize how much information we have in the data (Imagine a variable with all same values as its observations, then the variance is 0, and intuitively speaking, there’s not too much information from this variable because every observation is the same). The diagonal elements of the covariance matrix stand for variability of each variable itself, and off-diagonal elements in covariance matrix represents how variables are correlated with each other.

Ultimately we want our transformed variables to contain as much as information (or equivalently, account for as much variability as possible)."

I.e., the author suggests that using the covariance matrix is the way to do the PCA.
 
  • Like
Likes   Reactions: BWV

Similar threads

  • · Replies 13 ·
Replies
13
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
Replies
63
Views
11K
  • · Replies 25 ·
Replies
25
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 7 ·
Replies
7
Views
8K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 29 ·
Replies
29
Views
7K
  • · Replies 4 ·
Replies
4
Views
1K
Replies
3
Views
2K