Factor analysis versus Bayes estimates?

In summary, the conversation discusses the use of factor analysis or principle component analysis in a retrospective research on breast cancer using multivariate estimates. It is mentioned that these techniques are used to reduce noisy data and get better estimates, but it is argued that they may not necessarily give the best predictor of cancer. Instead, it is suggested to use stepwise multivariate regression which takes into account the correlations of the independent variables in a methodical way and includes only the ones that contribute the most towards explaining the dependent variable. It is also mentioned that reducing the number of variables may decrease the degree of freedom, which can be problematic if important variables are being overlooked.
I am doing a retrospective research on breast cancer using multivariate estimates. The aim of the research is to calculate the probabilty of findng the breast cancer given the multiple independent variables (IVs). So the outcome should be binary in nature (cancer versus benign). Some of IVs are correlated. My question, should I do Factor Analaysis on IVs before starting my logistic regression? I know that FA or PCA are used to reduce the noisy data and get better estimate using higher variance along the direction of the principles components. But intuitively, I guess that if I have multiple variables, eventhough they may be correlated, they still be able to strongly differentiate the cancer from benign cases. The second statement I inferred from Bayes equation where the posterior odd for cancer will be proportinated to the product of sensitivities of each variable which gives credits to the presence of correlated IVs. So, how do both ideas be conceived with each other?

A factor analysis or principle component analysis will tell you what linear combination of independent variables gives the best single dimension approximation of the independent variable data pattern. There is no reason to think that will also give the best predictor of cancer. In fact, it may cause you to emphasize variables that have very little influence on cancer. You would be better off using something like stepwise multivariate regression that is explicitly looking for the best predictor of cancer. Stepwise regression will account for any correlations of the independent variables in a very logical and methodical way.

FactChecker said:
A factor analysis or principle component analysis will tell you what linear combination of independent variables gives the best single dimension approximation of the independent variable data pattern. There is no reason to think that will also give the best predictor of cancer. In fact, it may cause you to emphasize variables that have very little influence on cancer. You would be better off using something like stepwise multivariate regression that is explicitly looking for the best predictor of cancer. Stepwise regression will account for any correlations of the independent variables in a very logical and methodical way.
My previous suggestion to use FA is due to the following reasons:
1) If the correlation between the variables that I think they might be correlated is more than 0.5, according to some reference, then FA or PCA will be better to reduce them into a single best variable that explains most of their variances.
2) Reducing the number of variables will also reduce the degree of freedom.

I'm curious what the definition of "best predictor" would be in this situation. If certain combinations of the values of the variables are rare in the population then then the task of making a prediction for such cases would be rare. Does measuring the error of a predictor involve weighting its error on particular cases by the frequency of those cases?

FA or PCA will be better to reduce them into a single best variable that explains most of their variances.
Yes. Exactly. The problem is that you want to find the variance that is most strongly correlated with cancer occurrence. That is not the same thing. You run the risk that FA or PCA will hide the variance that you really are looking for.
2) Reducing the number of variables will also reduce the degree of freedom.
That can be bad. Suppose you had data that measured the size of people in many ways (height, weight, arm span, waist size, etc) but only one independent variable that mattered (the amount of cigarette smoking). Then your principle factor would be a combination of size measurements and would ignore the smoking. So your results from there on would have very little correlation with lung cancer occurrence.

It is much better to use a statistical approach that is designed to explain the dependent variable and only include the independent variables that contribute the most toward that goal. Stepwise linear regression does exactly that. You don't have to worry about correlated independent variables -- stepwise regression takes care of that. First it puts in the independent variable most highly correlated with cancer. Then it looks for the independent variable that would correlate best with the remaining, unexplained cancer. If it is statistically justified, it adds that variable to the regression. (Otherwise, it stops.) Then it looks for a third independent that correlates best with the cancer that was not accounted for by the first two variables. It proceeds logically that way till there is no statistical justification for adding any of the remaining variables. So you see that a variable which is highly correlated with one that was already included would not contribute much toward explaning the remaining cancer. So it would not be likely to be included in the regression. If it is, it would be at a low level.

Last edited:
FactChecker said:
Yes. Exactly. The problem is that you want to find the variance that is most strongly correlated with cancer occurrence. That is not the same thing. You run the risk that FA or PCA will hide the variance that you really are looking for.
That can be bad. Suppose you had data that measured the size of people in many ways (height, weight, arm span, waist size, etc) but only one independent variable that mattered (the amount of cigarette smoking). Then your principle factor would be a combination of size measurements and would ignore the smoking. So your results from there on would have very little correlation with lung cancer occurrence.

It is much better to use a statistical approach that is designed to explain the dependent variable and only include the independent variables that contribute the most toward that goal. Stepwise linear regression does exactly that. You don't have to worry about correlated independent variables -- stepwise regression takes care of that. First it puts in the independent variable most highly correlated with cancer. Then it looks for the independent variable that would correlate best with the remaining, unexplained cancer. If it is statistically justified, it adds that variable to the regression. (Otherwise, it stops.) Then it looks for a third independent that correlates best with the cancer that was not accounted for by the first two variables. It proceeds logically that way till there is no statistical justification for adding any of the remaining variables. So you see that a variable which is highly correlated with one that was already included would not contribute much toward explaning the remaining cancer. So it would not be likely to be included in the regression. If it is, it would be at a low level.
Thanks, stepwise linear regression was new for me.

But there is one catch related to the methodology in general which I have just noticed now. How do I infer the probability of the occurrence of the cancer in doing a retrospective study ( case- control) which it is suppose to infer the occurrence of the exposure? Because my aim in the study is to calculate the probability of cancer given all explanatory variables. Any transformation that yields the odd of the cancer from the odd of the exposure ( the presence of explanatory variables)?

Stephen Tashi said:
I'm curious what the definition of "best predictor" would be in this situation. If certain combinations of the values of the variables are rare in the population then then the task of making a prediction for such cases would be rare. Does measuring the error of a predictor involve weighting its error on particular cases by the frequency of those cases?
Yes the error should be weighted against the number of cases. For example, if we find the presence of macro-calcification in the tumor doesn't statistically correlate significantly with the prediction of the cancer because of the large variance related to the error, then weighting this variable will be also insignificant in the prediction.

To reiterate my concern about deriving the odd for the disease given the variable(s) from the odd of the variable given the disease. As I mentioned, I am doing a retrospective study where the odd of the variable given the disease can be calculated. However, my real concern is calculate the odd of the disease given the variable(s). Because this is what matters for the doctor and the patient who is presented by some variable(s).

I am thinking to do the following: I can conduct a retrospective study as usual and derive the odd of the variable(s) given the disease as expected. Then calculating the probability of having the variable given the disease p(variable I disease), from the logistic regression. Finally convert that probability to probability of having the disease given the variable from Bayes equation p(disease I variable) = π(disease) p(variable I disease) / [ π(disease) p(variable I disease) +π(normal) p(variable I normal) ]. Where, normal, stands for the normal cases without the disease, π(disease) is the prevalence of the disease in the population and π(normal) is the prevalence of the normal in the population.

Attachments

• case control.png
1.6 KB · Views: 418

What is the difference between factor analysis and Baye's estimates?

Factor analysis is a statistical method used to reduce a large number of variables into a smaller set of factors that explain the relationships among the variables. Baye's estimates, on the other hand, is a technique that uses Baye's theorem to update prior knowledge about a probability as more evidence or data is collected.

Which one is better for data analysis, factor analysis or Baye's estimates?

It depends on the type of data you have and the research question you are trying to answer. Factor analysis is useful for identifying underlying factors and relationships among variables, while Baye's estimates are useful for making predictions or updating prior beliefs.

Can factor analysis and Baye's estimates be used together?

Yes, factor analysis and Baye's estimates can be used together. For example, factor analysis can be used to identify the most important variables that can be used in Baye's estimates to make predictions or update beliefs.

What are the assumptions of factor analysis and Baye's estimates?

The main assumption of factor analysis is that the observed variables are influenced by a smaller number of underlying factors. The assumptions of Baye's estimates include prior beliefs, likelihood of the data, and the ability to update beliefs based on new evidence.

How do I choose between factor analysis and Baye's estimates for my research?

Choosing between factor analysis and Baye's estimates depends on the research question and the type of data you have. If you are interested in understanding the underlying factors or relationships among variables, factor analysis would be more appropriate. If your goal is to make predictions or update prior beliefs, then Baye's estimates would be more suitable.

• Set Theory, Logic, Probability, Statistics
Replies
2
Views
22K
Replies
13
Views
2K
• Cosmology
Replies
1
Views
1K
• General Math
Replies
4
Views
4K