Undergrad Principal Component Analysis with dependent variable?

Click For Summary
The discussion revolves around the application of Principal Component Analysis (PCA) to a dataset that includes both independent variables (rate constants) and a dependent variable (quantum yield). The authors of the referenced paper utilize PCA to analyze how these factors affect quantum yield, raising questions about the validity of including a dependent variable in PCA. While PCA typically aims to reduce dimensionality and summarize independent data, the loading plot indicates a strong correlation between the rate constants and quantum yield. The confusion lies in understanding how PCA can elucidate the relationship between these variables when the dependent variable is included. Ultimately, the discussion seeks clarity on the appropriateness of this PCA approach in revealing the contributions of rate constants to quantum yield.
HAYAO
Science Advisor
Gold Member
Messages
379
Reaction score
238
TL;DR
I found a paper utilizing PCA to a multivariable system, in which one of them is a dependent variable. I though the convention was to leave dependent variable out.
So I found a paper,
https://www.sciencedirect.com/science/article/pii/S0022231314006048?via=ihub
which concerns a property of a compound called "Quantum Yield", which is a result of various factors (independent variables). The authors are trying to figure out what factors affect the quantum yield.

The authors freely allow these factors to change and calculates the Quantum Yield (by numerically solving a system of differential equations). Obviously, this makes "Quantum yield" a dependent variable. The authors utilizes PCA on both the various factors and the quantum yield, therefore applying PCA on both independent and dependent variable .

Fig.4 of Page 607 shows the loading plot (correlation circle) of the result. The "Yield" refers to the "Quantum yield". The result is meaningful, at least for this figure, because it shows that how the independent variables affect the dependent variable. But I learned that you should separate dependent variables out before you perform PCA with it. However, it could the case that it is appropriate for this objective. So, is this a valid approach in using PCA for this purpose?

I'm kind of confused here, and I am sure I must be making some silly misunderstanding.
I would greatly appreciate it if anyone could help me.

Thank you.
 
Physics news on Phys.org
Typically with PCA all the variables are 'dependent' - the goal being to extract orthogonal factors that will then serve as independent variables to explain variation across the set
 
  • Like
Likes Stephen Tashi
The purpose of PCA is to determine the best reduced-variable description of the positions of the data. So it minimizes the sum-squared perpendicular distances from the data to a reduced-dimension space. It has nothing to do with any dependencies. As such, it is not the best predictor of any of the original variables. The best predictor would minimize the sum-squared errors in that variable. That is not the same as minimizing the sum-squared perpendicular distances.
The situation that you are used to is where PCA is used to give a lower-dimension set of variables that best summarize the independent data and are then used in another linear regression to predict the dependent variable.
 
I'm sorry, I don't think I phrased my question well.

Yes, indeed all the variables could be "dependent" in PCA, and the purpose of PCA is to reduce redundant and meaningless variables, and extract a new set of orthogonal axis that explains the variation well.

"Quantum Yield" can be explicitly expressed as a function of "rate constants" (such that QY = k12 + k2*k3 + ...), and the authors let these "rate constants" to vary within certain range to calculate the "quantum yield" for each set of "rate constants".

My question was that how would using PCA on a data including both "quantum yield" and "rate constant" would help understand the main contributing factors of "rate constant" on "quantum yield"? I thought that PCA loading plot shows how the "rate constants" explains the "variation of the data" of the selected PC, not the "quantum yield". Loading plot (correlation circle) shows that the "quantum yield" is always -1.0 (or 1.0 in different experiment shown in Supplementary Information) on PC1, and 0.0 on PC2. I do not believe this is a coincidence. The authors then explain that the loading plot shows some "rate constant" was on the negative, showing that those "rate constants" contribute negatively on "quantum yield". Why does the PCA authors use allows us to understand how the "rate constants" contributes to the "quantum yield"? Or should I change the question to, how does including both the "quantum yield" and the "rate constants" in the data set and applying PCA on them would allow us to understand how "rate constants" explain the "quantum yield"?

Although the article does not explicitly state the procedures, I do not think the authors used any Principal Component Regression since "quantum yield" is plotted on the loading plot explicitly.
 
The standard _A " operator" maps a Null Hypothesis Ho into a decision set { Do not reject:=1 and reject :=0}. In this sense ( HA)_A , makes no sense. Since H0, HA aren't exhaustive, can we find an alternative operator, _A' , so that ( H_A)_A' makes sense? Isn't Pearson Neyman related to this? Hope I'm making sense. Edit: I was motivated by a superficial similarity of the idea with double transposition of matrices M, with ## (M^{T})^{T}=M##, and just wanted to see if it made sense to talk...

Similar threads

  • · Replies 1 ·
Replies
1
Views
23K
Replies
1
Views
2K
  • · Replies 45 ·
2
Replies
45
Views
6K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 37 ·
2
Replies
37
Views
6K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 15 ·
Replies
15
Views
5K