Using PCA for variable reduction

In summary, Jolliffe suggests using a "cluster criterion" technique for variable reduction in PCA. This involves identifying clusters of variables and retaining one variable from each cluster to reduce the total number of variables. However, there is no clear example of how to implement this technique. In the given conversation, the speaker is trying to apply this technique to a problem involving differentiating two substances based on their spectra. The resulting graph shows that the differentiation is mainly based on PC2, with high values of PC2 associated with one substance and low values with the other. To select a subset of variables, the speaker could use the wavelengths with the largest loading in PC2, which would be the wavelengths around 550nm and 700nm. However,
  • #1
roam
1,271
12
In the textbook “Principal Component Analysis” Jolliffe (§9.2) suggests the following method for variable reduction:

When the variables fall into well-defined clusters, there will be one high-variance PC and, except in the case of 'single-variable' clusters, one or more low-variance PCs associated with each cluster of variables. Thus, PCA will identify the presence of clusters among the variables, and can be thought of as a competitor to standard cluster analysis of variables. [...] Identifying clusters of variables may be of general interest in investigating the structure of a data set but, more specifically, if we wish to reduce the number of variables without sacrificing too much information, then we could retain one variable from each cluster.”

I am trying to apply this idea of variable reduction to a problem which involves differentiating two substances based on their spectra which is composed of 1350 variables (i.e. wavelengths):

apmXOkU.png


And here is the resulting differentiation between the species:

DDQEa6r.jpg


I want to be able to perform the same differentiation, but using fewer variables. Unfortunately, the textbook does not give any examples of how to do this. The aim is to divide variables, rather than observations, into groups. So, how should we plot the variables in order to see the clusters? :confused:

For instance, I have tried to plot the loadings at each variable with respect to the first two PCs and obtained:

cvbtkkS.png


In this plot there doesn't seem to be any well-defined clusters to choose variables from. Am I following Jolliffe's method correctly?

Any explanation would be greatly appreciated.

P.S. Plotting the actual PCs give the following (the ordinate should represent the correlation coefficients between a given variable and the PC):

ooZZM91.png
 

Attachments

  • apmXOkU.png
    apmXOkU.png
    8.7 KB · Views: 580
  • DDQEa6r.jpg
    DDQEa6r.jpg
    16.5 KB · Views: 507
  • cvbtkkS.png
    cvbtkkS.png
    6.2 KB · Views: 554
  • ooZZM91.png
    ooZZM91.png
    23.9 KB · Views: 561
  • Like
Likes Klystron
Physics news on Phys.org
  • #2
roam said:
In the textbook “Principal Component Analysis” Jolliffe (§9.2) suggests the following method for variable reduction:

When the variables fall into well-defined clusters, there will be one high-variance PC and, except in the case of 'single-variable' clusters, one or more low-variance PCs associated with each cluster of variables.

It's unclear to me what the quoted passage means by "the variables". On the one hand, "the variables" might mean the variables in the raw data. On other hand "the variables" might be the coefficients for the principal components that are associated with a leaf.

A principal component is a vector and it is a constant vector. So the terminology "high variance PC" is imprecise. I interpret it to mean that the coefficient associated with that vector (considered as a random variable varying over the population of items in the data) has high variance.
 
  • Like
Likes Klystron
  • #3
roam said:
And here is the resulting differentiation between the species:

ddqea6r-jpg.jpg

In the graph above, it seems you have already achieved variable reduction. Instead of each spectra being represented by a vector with 1350 observations, the above plot represents each spectra with just two variables (PC1 and PC2). Furthermore, the difference between the red and blue spectra seems to along PC2, with high values of PC2 associated with your red spectra and low values of PC2 associated with the blue spectra.
 

Attachments

  • ddqea6r-jpg.jpg
    ddqea6r-jpg.jpg
    16.5 KB · Views: 445
  • #4
Stephen Tashi said:
It's unclear to me what the quoted passage means by "the variables". On the one hand, "the variables" might mean the variables in the raw data. On other hand "the variables" might be the coefficients for the principal components that are associated with a leaf.

It is referring to the variables in the raw data.

Here is a paper by the same author that also mentions this "cluster criterion" technique for selecting a subset of the original variables in the raw data. But the problem is that they don't give a clear worked example, so I am not sure how to plot the data to see clusters in the variables.

Are they suggesting making a dendrogram based on some kind of cluster analysis of the variables? :confused:

Ygggdrasil said:
In the graph above, it seems you have already achieved variable reduction. Instead of each spectra being represented by a vector with 1350 observations, the above plot represents each spectra with just two variables (PC1 and PC2). Furthermore, the difference between the red and blue spectra seems to along PC2, with high values of PC2 associated with your red spectra and low values of PC2 associated with the blue spectra.

Hi Ygggdrasil,

This is true. But I have used all 1350 variables (i.e., wavelengths) to conduct PCA and differentiate the species. And in my application, 1350 variables are too many. So I want to select a subset of these variables/wavelengths to retain. The technique suggested above by Jolliffe is meant to be a criterion for selecting the best subset of variables. Unfortunately, I couldn't find a guide on how to implement this technique.
 
Last edited:
  • #5
roam said:
This is true. But I have used all 1350 variables (i.e., wavelengths) to conduct PCA and differentiate the species. And in my application, 1350 variables are too many. So I want to select a subset of these variables/wavelengths to retain. The technique suggested above by Jolliffe is meant to be a criterion for selecting the best subset of variables. Unfortunately, I couldn't find a guide on how to implement this technique.

In this case, I'd select the wavelengths with the largest loading in PC2 (which would be the wavelengths around 550nm and 700nm). Looking at the spectra, it does seem to be the case that the red spectra tend to have higher reflectance at those wavelengths than the blue spectra.
 
  • Like
Likes roam
  • #6
Ygggdrasil said:
In this case, I'd select the wavelengths with the largest loading in PC2 (which would be the wavelengths around 550nm and 700nm). Looking at the spectra, it does seem to be the case that the red spectra tend to have higher reflectance at those wavelengths than the blue spectra.

So, what if the separation is mainly in terms of PC2, but some of the peak regions of that PC curve have negative values? Would you use wavelengths that have negative loadings? :confused:

For instance, in the above, the PCA was done with the covariance matrix of the variables. If I recompute the PCA using the correlation matrix, the separation is still in terms of PC2, but I get totally different values for the loadings. Some of the peaks of PC2 will have negative values. How should we treat these regions?
 
  • #7
roam said:
So, what if the separation is mainly in terms of PC2, but some of the peak regions of that PC curve have negative values? Would you use wavelengths that have negative loadings? :confused:

For instance, in the above, the PCA was done with the covariance matrix of the variables. If I recompute the PCA using the correlation matrix, the separation is still in terms of PC2, but I get totally different values for the loadings. Some of the peaks of PC2 will have negative values. How should we treat these regions?
Good point. You would use the variables where the absolute value of the loadings are greatest, as these variables have the greatest effect on the value of PC2.
 
  • Like
Likes roam

1. What is PCA and how does it work?

PCA stands for Principal Component Analysis. It is a statistical technique used to reduce the number of variables in a dataset while retaining the most important information. It works by identifying the variables that contribute the most to the overall variance in the data and creating new variables (called principal components) that are a combination of these variables.

2. When should PCA be used for variable reduction?

PCA is useful when dealing with datasets that have a large number of variables, making it difficult to analyze or visualize the data. It can also be used to identify relationships and patterns in the data that may not be apparent with the original variables. However, it should not be used as a substitute for proper feature selection or when the variables have a clear and direct relationship with the outcome.

3. How many principal components should be retained?

The number of principal components to retain depends on the amount of variance explained by each component. A commonly used rule of thumb is to retain enough components to explain at least 70-80% of the total variance in the data. However, the specific number of components to retain may vary depending on the dataset and the goals of the analysis.

4. Can PCA be used for categorical variables?

No, PCA is typically used for continuous variables. Categorical variables can be converted to numerical values and included in the analysis, but this may not always be appropriate. It is important to carefully consider the nature of the data and the goals of the analysis before using PCA for variable reduction.

5. Are there any limitations to using PCA for variable reduction?

Yes, there are some limitations to using PCA for variable reduction. One limitation is that the new variables created by PCA may be difficult to interpret and may not have a direct relationship with the original variables. Additionally, PCA assumes that the data is linearly related and normally distributed, so it may not be appropriate for all types of data. It is important to carefully evaluate the data and consider other methods of variable reduction before using PCA.

Similar threads

Replies
1
Views
2K
Replies
2
Views
256
  • Differential Equations
Replies
2
Views
994
  • General Math
Replies
5
Views
1K
  • Linear and Abstract Algebra
Replies
4
Views
2K
  • Calculus and Beyond Homework Help
Replies
4
Views
1K
  • Programming and Computer Science
Replies
1
Views
2K
  • Differential Equations
Replies
3
Views
2K
  • Introductory Physics Homework Help
Replies
1
Views
1K
  • Calculus
Replies
7
Views
3K
Back
Top