Principal Component Analysis with dependent variable?

In summary: PCA is used to reduce redundant and irrelevant variables, and in this case the authors were interested in understanding why a particular "rate constant" contributed negatively on "quantum yield".
  • #1
HAYAO
Science Advisor
Gold Member
376
235
TL;DR Summary
I found a paper utilizing PCA to a multivariable system, in which one of them is a dependent variable. I though the convention was to leave dependent variable out.
So I found a paper,
https://www.sciencedirect.com/science/article/pii/S0022231314006048?via=ihub
which concerns a property of a compound called "Quantum Yield", which is a result of various factors (independent variables). The authors are trying to figure out what factors affect the quantum yield.

The authors freely allow these factors to change and calculates the Quantum Yield (by numerically solving a system of differential equations). Obviously, this makes "Quantum yield" a dependent variable. The authors utilizes PCA on both the various factors and the quantum yield, therefore applying PCA on both independent and dependent variable .

Fig.4 of Page 607 shows the loading plot (correlation circle) of the result. The "Yield" refers to the "Quantum yield". The result is meaningful, at least for this figure, because it shows that how the independent variables affect the dependent variable. But I learned that you should separate dependent variables out before you perform PCA with it. However, it could the case that it is appropriate for this objective. So, is this a valid approach in using PCA for this purpose?

I'm kind of confused here, and I am sure I must be making some silly misunderstanding.
I would greatly appreciate it if anyone could help me.

Thank you.
 
Physics news on Phys.org
  • #2
Typically with PCA all the variables are 'dependent' - the goal being to extract orthogonal factors that will then serve as independent variables to explain variation across the set
 
  • Like
Likes Stephen Tashi
  • #3
The purpose of PCA is to determine the best reduced-variable description of the positions of the data. So it minimizes the sum-squared perpendicular distances from the data to a reduced-dimension space. It has nothing to do with any dependencies. As such, it is not the best predictor of any of the original variables. The best predictor would minimize the sum-squared errors in that variable. That is not the same as minimizing the sum-squared perpendicular distances.
The situation that you are used to is where PCA is used to give a lower-dimension set of variables that best summarize the independent data and are then used in another linear regression to predict the dependent variable.
 
  • #4
I'm sorry, I don't think I phrased my question well.

Yes, indeed all the variables could be "dependent" in PCA, and the purpose of PCA is to reduce redundant and meaningless variables, and extract a new set of orthogonal axis that explains the variation well.

"Quantum Yield" can be explicitly expressed as a function of "rate constants" (such that QY = k12 + k2*k3 + ...), and the authors let these "rate constants" to vary within certain range to calculate the "quantum yield" for each set of "rate constants".

My question was that how would using PCA on a data including both "quantum yield" and "rate constant" would help understand the main contributing factors of "rate constant" on "quantum yield"? I thought that PCA loading plot shows how the "rate constants" explains the "variation of the data" of the selected PC, not the "quantum yield". Loading plot (correlation circle) shows that the "quantum yield" is always -1.0 (or 1.0 in different experiment shown in Supplementary Information) on PC1, and 0.0 on PC2. I do not believe this is a coincidence. The authors then explain that the loading plot shows some "rate constant" was on the negative, showing that those "rate constants" contribute negatively on "quantum yield". Why does the PCA authors use allows us to understand how the "rate constants" contributes to the "quantum yield"? Or should I change the question to, how does including both the "quantum yield" and the "rate constants" in the data set and applying PCA on them would allow us to understand how "rate constants" explain the "quantum yield"?

Although the article does not explicitly state the procedures, I do not think the authors used any Principal Component Regression since "quantum yield" is plotted on the loading plot explicitly.
 

What is Principal Component Analysis (PCA)?

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset while retaining most of the information. It does this by creating new variables, known as principal components, that are linear combinations of the original variables.

How does PCA handle datasets with a dependent variable?

PCA can be used on datasets with a dependent variable, but it is typically used on datasets with only independent variables. The dependent variable is not included in the PCA process, but the resulting principal components can still be used for prediction or classification tasks.

What are the assumptions of PCA with a dependent variable?

The assumptions of PCA with a dependent variable are the same as those for traditional PCA. These include linearity, normality, and homoscedasticity. Additionally, the dependent variable should not be highly correlated with the independent variables, as this can lead to biased results.

How is PCA with a dependent variable different from traditional PCA?

The main difference between PCA with a dependent variable and traditional PCA is the inclusion of the dependent variable in the analysis. This means that the principal components are created based on the relationship between the independent and dependent variables, rather than just the relationships between the independent variables.

What are the benefits of using PCA with a dependent variable?

PCA with a dependent variable can help to reduce the dimensionality of a dataset and identify the most important variables for predicting the dependent variable. It can also help to improve the performance of predictive models and reduce the risk of overfitting. Additionally, it can provide insights into the relationships between the dependent and independent variables.

Similar threads

  • Quantum Interpretations and Foundations
2
Replies
45
Views
3K
  • Quantum Interpretations and Foundations
2
Replies
37
Views
1K
Replies
1
Views
2K
  • Quantum Interpretations and Foundations
Replies
1
Views
525
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
22K
  • High Energy, Nuclear, Particle Physics
Replies
7
Views
1K
  • Beyond the Standard Models
Replies
9
Views
498
  • Calculus and Beyond Homework Help
Replies
1
Views
757
  • Quantum Interpretations and Foundations
Replies
2
Views
1K
Back
Top