# Factor analysis versus Bayes estimates?

1. May 19, 2015

I am doing a retrospective research on breast cancer using multivariate estimates. The aim of the research is to calculate the probabilty of findng the breast cancer given the multiple independent variables (IVs). So the outcome should be binary in nature (cancer versus benign). Some of IVs are correlated. My question, should I do Factor Analaysis on IVs before starting my logistic regression? I know that FA or PCA are used to reduce the noisy data and get better estimate using higher variance along the direction of the principles components. But intuitively, I guess that if I have multiple variables, eventhough they may be correlated, they still be able to strongly differentiate the cancer from benign cases. The second statement I inferred from Bayes equation where the posterior odd for cancer will be proportinated to the product of sensitivities of each variable which gives credits to the presence of correlated IVs. So, how do both ideas be conceived with each other?

2. May 22, 2015

### FactChecker

A factor analysis or principle component analysis will tell you what linear combination of independent variables gives the best single dimension approximation of the independent variable data pattern. There is no reason to think that will also give the best predictor of cancer. In fact, it may cause you to emphasize variables that have very little influence on cancer. You would be better off using something like stepwise multivariate regression that is explicitly looking for the best predictor of cancer. Stepwise regression will account for any correlations of the independent variables in a very logical and methodical way.

3. May 23, 2015

My previous suggestion to use FA is due to the following reasons:
1) If the correlation between the variables that I think they might be correlated is more than 0.5, according to some reference, then FA or PCA will be better to reduce them into a single best variable that explains most of their variances.
2) Reducing the number of variables will also reduce the degree of freedom.

4. May 23, 2015

### Stephen Tashi

I'm curious what the definition of "best predictor" would be in this situation. If certain combinations of the values of the variables are rare in the population then then the task of making a prediction for such cases would be rare. Does measuring the error of a predictor involve weighting its error on particular cases by the frequency of those cases?

5. May 23, 2015

### FactChecker

Yes. Exactly. The problem is that you want to find the variance that is most strongly correlated with cancer occurrence. That is not the same thing. You run the risk that FA or PCA will hide the variance that you really are looking for.
That can be bad. Suppose you had data that measured the size of people in many ways (height, weight, arm span, waist size, etc) but only one independent variable that mattered (the amount of cigarette smoking). Then your principle factor would be a combination of size measurements and would ignore the smoking. So your results from there on would have very little correlation with lung cancer occurrence.

It is much better to use a statistical approach that is designed to explain the dependent variable and only include the independent variables that contribute the most toward that goal. Stepwise linear regression does exactly that. You don't have to worry about correlated independent variables -- stepwise regression takes care of that. First it puts in the independent variable most highly correlated with cancer. Then it looks for the independent variable that would correlate best with the remaining, unexplained cancer. If it is statistically justified, it adds that variable to the regression. (Otherwise, it stops.) Then it looks for a third independent that correlates best with the cancer that was not accounted for by the first two variables. It proceeds logically that way till there is no statistical justification for adding any of the remaining variables. So you see that a variable which is highly correlated with one that was already included would not contribute much toward explaning the remaining cancer. So it would not be likely to be included in the regression. If it is, it would be at a low level.

Last edited: May 23, 2015
6. May 23, 2015

Thanks, stepwise linear regression was new for me.

But there is one catch related to the methodology in general which I have just noticed now. How do I infer the probability of the occurrence of the cancer in doing a retrospective study ( case- control) which it is suppose to infer the occurrence of the exposure? Because my aim in the study is to calculate the probability of cancer given all explanatory variables. Any transformation that yields the odd of the cancer from the odd of the exposure ( the presence of explanatory variables)?

7. May 23, 2015

Yes the error should be weighted against the number of cases. For example, if we find the presence of macro-calcification in the tumor doesn't statistically correlate significantly with the prediction of the cancer because of the large variance related to the error, then weighting this variable will be also insignificant in the prediction.

8. May 24, 2015

To reiterate my concern about deriving the odd for the disease given the variable(s) from the odd of the variable given the disease. As I mentioned, I am doing a retrospective study where the odd of the variable given the disease can be calculated. However, my real concern is calculate the odd of the disease given the variable(s). Because this is what matters for the doctor and the patient who is presented by some variable(s).

I am thinking to do the following: I can conduct a retrospective study as usual and derive the odd of the variable(s) given the disease as expected. Then calculating the probability of having the variable given the disease p(variable I disease), from the logistic regression. Finally convert that probability to probability of having the disease given the variable from Bayes` equation p(disease I variable) = π(disease) p(variable I disease) / [ π(disease) p(variable I disease) +π(normal) p(variable I normal) ]. Where, normal, stands for the normal cases without the disease, π(disease) is the prevalence of the disease in the population and π(normal) is the prevalence of the normal in the population.