# I Regression on extracted factors

1. Jul 31, 2017

My initial objective is to make a regression of $y$ dependent variable on a given set of $x_1$, $x_2$... and $x_m$ independent variables. Suppose, I am dealing with a data set of $n$ samples, I found that the variables are correlated so I decided to do factor analysis to best represent the variables in fewer uncorrelated factors $v$ with number $k<m$.
I just would like to know how to regress $y$ on $v_1$, $v_2$ ... $v_k$ for each data sample so as to take the form $y=b_0+b_1 v_1+...+b_k v_k$
I know the factor loading matrix represents the variable $x$ as a linear combination of factors $v$ in the form of $x=Fv$ where $F$ is the factor loading matrix but how this may help in my case. I assume I need the opposite, which to represent $v$ in terms of given $x$. I thought to extract $v$ in term of $x$ by inverse transformation but $F$ is not square matrix so it can not be inverted.

Last edited: Jul 31, 2017
2. Jul 31, 2017

I thought about the follow too; instead of doing factor analysis, I may do SVD (singular value decomposition) of the original data set of $mn$ matrix. Therafter, I reduce the matrix into a reduced form of $USV^T$ where $V$ is the $n$ x $k$ matrix. Then I can do the regression straight from the set of $(v_i,y_i)$ where $i=1...n$. Not sure if this would be a convenient method! And even I do that, will the number of extracted factors in factor analysis will be corresponding to the number to the eigen values in the reduced form of the data matrix here?

Last edited: Jul 31, 2017
3. Aug 1, 2017

### FactChecker

You may want to reconsider your decision to use FA or SVD just because the xis are correlated. Independent variables are almost always correlated to some extent, yet stepwise regression can be used. The disadvantage of FA and SVD is that you end up with obscure factors that are combinations of all your xis and whose interpretation is obscure. I think it is better to only use those techniques when it is your goal to formulate abstract factors and general concepts from data.

The advantage of stepwise linear regression over FA is that the final model is in terms of a limited number of xis, all of which are understandable. Forward stepwise linear regression would first introduce the most statistically significant variable. Then it would remove the influence of that variable from all other variables, ending up with residuals. Then it would consider the variable with the most statistically significant residuals and include it in the model only if it was statistically justified. It continues in that manner till there are no more statistically significant residuals to include in the model. That process keeps correlated variables from getting into the model unless there is still something remaining that the later variable is needed to explain. There are algorithms for forward, backward, and bidirectional regression. I recommend bidirectional.

Last edited: Aug 1, 2017