Principal Component Analysis with dependent variable?

Click For Summary

Discussion Overview

The discussion revolves around the application of Principal Component Analysis (PCA) in the context of a study on "Quantum Yield," which is treated as a dependent variable influenced by various independent factors (rate constants). Participants explore the appropriateness of including both dependent and independent variables in PCA and the implications for understanding the relationships between these variables.

Discussion Character

  • Debate/contested
  • Technical explanation
  • Conceptual clarification

Main Points Raised

  • One participant references a paper that uses PCA on both independent variables and the dependent variable "Quantum Yield," questioning the validity of this approach.
  • Another participant suggests that in PCA, all variables can be considered 'dependent' as the goal is to extract orthogonal factors that explain variation across the dataset.
  • A different viewpoint emphasizes that PCA minimizes sum-squared perpendicular distances rather than directly predicting any original variable, indicating a distinction between PCA and regression analysis.
  • A participant clarifies that "Quantum Yield" can be expressed as a function of rate constants and questions how PCA can elucidate the contribution of these rate constants to the quantum yield when both are included in the analysis.
  • Concerns are raised about the interpretation of PCA loading plots, particularly regarding the significance of the quantum yield's position in the principal components and its implications for understanding the relationship with rate constants.

Areas of Agreement / Disagreement

Participants express differing views on the appropriateness of including dependent variables in PCA, with some asserting that it is valid while others question its utility. The discussion remains unresolved regarding the implications of PCA results when both types of variables are included.

Contextual Notes

Participants note potential limitations in understanding how PCA results relate to the contributions of independent variables to the dependent variable, particularly in the context of the specific study referenced.

HAYAO
Science Advisor
Gold Member
Messages
381
Reaction score
238
TL;DR
I found a paper utilizing PCA to a multivariable system, in which one of them is a dependent variable. I though the convention was to leave dependent variable out.
So I found a paper,
https://www.sciencedirect.com/science/article/pii/S0022231314006048?via=ihub
which concerns a property of a compound called "Quantum Yield", which is a result of various factors (independent variables). The authors are trying to figure out what factors affect the quantum yield.

The authors freely allow these factors to change and calculates the Quantum Yield (by numerically solving a system of differential equations). Obviously, this makes "Quantum yield" a dependent variable. The authors utilizes PCA on both the various factors and the quantum yield, therefore applying PCA on both independent and dependent variable .

Fig.4 of Page 607 shows the loading plot (correlation circle) of the result. The "Yield" refers to the "Quantum yield". The result is meaningful, at least for this figure, because it shows that how the independent variables affect the dependent variable. But I learned that you should separate dependent variables out before you perform PCA with it. However, it could the case that it is appropriate for this objective. So, is this a valid approach in using PCA for this purpose?

I'm kind of confused here, and I am sure I must be making some silly misunderstanding.
I would greatly appreciate it if anyone could help me.

Thank you.
 
Physics news on Phys.org
Typically with PCA all the variables are 'dependent' - the goal being to extract orthogonal factors that will then serve as independent variables to explain variation across the set
 
  • Like
Likes   Reactions: Stephen Tashi
The purpose of PCA is to determine the best reduced-variable description of the positions of the data. So it minimizes the sum-squared perpendicular distances from the data to a reduced-dimension space. It has nothing to do with any dependencies. As such, it is not the best predictor of any of the original variables. The best predictor would minimize the sum-squared errors in that variable. That is not the same as minimizing the sum-squared perpendicular distances.
The situation that you are used to is where PCA is used to give a lower-dimension set of variables that best summarize the independent data and are then used in another linear regression to predict the dependent variable.
 
I'm sorry, I don't think I phrased my question well.

Yes, indeed all the variables could be "dependent" in PCA, and the purpose of PCA is to reduce redundant and meaningless variables, and extract a new set of orthogonal axis that explains the variation well.

"Quantum Yield" can be explicitly expressed as a function of "rate constants" (such that QY = k12 + k2*k3 + ...), and the authors let these "rate constants" to vary within certain range to calculate the "quantum yield" for each set of "rate constants".

My question was that how would using PCA on a data including both "quantum yield" and "rate constant" would help understand the main contributing factors of "rate constant" on "quantum yield"? I thought that PCA loading plot shows how the "rate constants" explains the "variation of the data" of the selected PC, not the "quantum yield". Loading plot (correlation circle) shows that the "quantum yield" is always -1.0 (or 1.0 in different experiment shown in Supplementary Information) on PC1, and 0.0 on PC2. I do not believe this is a coincidence. The authors then explain that the loading plot shows some "rate constant" was on the negative, showing that those "rate constants" contribute negatively on "quantum yield". Why does the PCA authors use allows us to understand how the "rate constants" contributes to the "quantum yield"? Or should I change the question to, how does including both the "quantum yield" and the "rate constants" in the data set and applying PCA on them would allow us to understand how "rate constants" explain the "quantum yield"?

Although the article does not explicitly state the procedures, I do not think the authors used any Principal Component Regression since "quantum yield" is plotted on the loading plot explicitly.
 

Similar threads

  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 1 ·
Replies
1
Views
23K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 45 ·
2
Replies
45
Views
7K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 37 ·
2
Replies
37
Views
7K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
4K