Principal Component Analysis with dependent variable?

HAYAO · Oct 22, 2020

So I found a paper,
https://www.sciencedirect.com/science/article/pii/S0022231314006048?via=ihub
which concerns a property of a compound called "Quantum Yield", which is a result of various factors (independent variables). The authors are trying to figure out what factors affect the quantum yield.

The authors freely allow these factors to change and calculates the Quantum Yield (by numerically solving a system of differential equations). Obviously, this makes "Quantum yield" a dependent variable. The authors utilizes PCA on both the various factors and the quantum yield, therefore applying PCA on both independent and dependent variable .

Fig.4 of Page 607 shows the loading plot (correlation circle) of the result. The "Yield" refers to the "Quantum yield". The result is meaningful, at least for this figure, because it shows that how the independent variables affect the dependent variable. But I learned that you should separate dependent variables out before you perform PCA with it. However, it could the case that it is appropriate for this objective. So, is this a valid approach in using PCA for this purpose?

I'm kind of confused here, and I am sure I must be making some silly misunderstanding.
I would greatly appreciate it if anyone could help me.

Thank you.

BWV · Oct 22, 2020

Typically with PCA all the variables are 'dependent' - the goal being to extract orthogonal factors that will then serve as independent variables to explain variation across the set

FactChecker · Oct 22, 2020

The purpose of PCA is to determine the best reduced-variable description of the positions of the data. So it minimizes the sum-squared perpendicular distances from the data to a reduced-dimension space. It has nothing to do with any dependencies. As such, it is not the best predictor of any of the original variables. The best predictor would minimize the sum-squared errors in that variable. That is not the same as minimizing the sum-squared perpendicular distances.
The situation that you are used to is where PCA is used to give a lower-dimension set of variables that best summarize the independent data and are then used in another linear regression to predict the dependent variable.

HAYAO · Oct 22, 2020

I'm sorry, I don't think I phrased my question well.

Yes, indeed all the variables could be "dependent" in PCA, and the purpose of PCA is to reduce redundant and meaningless variables, and extract a new set of orthogonal axis that explains the variation well.

"Quantum Yield" can be explicitly expressed as a function of "rate constants" (such that QY = k1² + k2*k3 + ...), and the authors let these "rate constants" to vary within certain range to calculate the "quantum yield" for each set of "rate constants".

My question was that how would using PCA on a data including both "quantum yield" and "rate constant" would help understand the main contributing factors of "rate constant" on "quantum yield"? I thought that PCA loading plot shows how the "rate constants" explains the "variation of the data" of the selected PC, not the "quantum yield". Loading plot (correlation circle) shows that the "quantum yield" is always -1.0 (or 1.0 in different experiment shown in Supplementary Information) on PC1, and 0.0 on PC2. I do not believe this is a coincidence. The authors then explain that the loading plot shows some "rate constant" was on the negative, showing that those "rate constants" contribute negatively on "quantum yield". Why does the PCA authors use allows us to understand how the "rate constants" contributes to the "quantum yield"? Or should I change the question to, how does including both the "quantum yield" and the "rate constants" in the data set and applying PCA on them would allow us to understand how "rate constants" explain the "quantum yield"?

Although the article does not explicitly state the procedures, I do not think the authors used any Principal Component Regression since "quantum yield" is plotted on the loading plot explicitly.

Principal Component Analysis with dependent variable?

What is Principal Component Analysis (PCA)?

How does PCA handle datasets with a dependent variable?

What are the assumptions of PCA with a dependent variable?

How is PCA with a dependent variable different from traditional PCA?

What are the benefits of using PCA with a dependent variable?

Similar threads

Hot Threads

Recent Insights