# What is the reasoning for this prediction?

## Main Question or Discussion Point

I am looking at a solution. The problem is to predict the number of medical doctors in a county. The data set has a few variables such as the observed number of crimes in that county, proverty level etc.

Since the problem is just to predict, then I assume that a data driven method is necessary.

It turned out, a data driven method was used, but somewhat in combination with logical reasoning about the problem( i.e what variable might logically interact with another etc)

The starting model is a linear regression full model with all the variables included but with no interaction term.
Then AIC, mallows CP, best subset was used. So we get 3 models, one from each of the three algorihms. The AIC model was picked because it was the most parsimonious. That and that it had all of the indicators for the regions of the US( i.e east, south etc). I suppose the reason for this is that the boxplot of doctors by region showed that the number of doctors might differ by region.

Then after the additive model was chosen. An Anova F test was used to compare the updated interaction models.

So if the model from AIC was y~a+b+c Anova(y~a+b+c, y~a+b+c+b:c) and Anova(y~a+b+c, y~a+b+c+a:c) where those are the suspected interactions. From mostly logical reasoning. i.e income of a county might have a differential effect on number of doctors across different levels of crime etc)

So in summary: 1)it was algorithms applied on an additive model with variables in the data set. 2) use the model from part 1) and add on the interaction terms. Then do an F test to compare model2 with model1 to see if an interaction included can better predict the number of physicians in a county.

Now my question is,

1) why didn't they just use the full model with the interaction terms in it to begin with and then apply the algorithms? That way, there is no need to follow up with a model comparison with the F-test.

2)And why AIC and the other methods? Why not just use one of them?

Related Set Theory, Logic, Probability, Statistics News on Phys.org
Buzz Bloom
Gold Member
Hi Apple:

I recommend the following non-technical book.
Weapons Of Math Destruction By Cathy O'Neal (2016)​
It is about the dangers of certain frequently used methods for implementing models used to make predictions.

Regards,
Buzz

MarneMath
There's myriad of reasons why a person may do this. Usually it's because including the full model would contain to many parameters and may obscure the main effects. It could be that backwards selection from the full model is too computationally intense. It could be that the person only cared to investigate interactions where the main effects were meaningful. (ie if you started from the full model you may include interactions that are uninteresting but those interactions basically force you to add those variables main effects into the model.)

• FallenApple
There's myriad of reasons why a person may do this. Usually it's because including the full model would contain to many parameters and may obscure the main effects. It could be that backwards selection from the full model is too computationally intense. It could be that the person only cared to investigate interactions where the main effects were meaningful. (ie if you started from the full model you may include interactions that are uninteresting but those interactions basically force you to add those variables main effects into the model.)
Thanks. That makes sense. So if there are interactions that we know from the beginning that are impossible, then we should not include them.

What if we want to investigate what might give rise to interactions? So even here we would not want to have a starting model that includes all possible interaction pairs. It's not that we might discover that no interaction exists for many of the pairs, but that the results are not even that useful since we would started with an overly complex model which makes the final model invalid. Then for that case, it would be better to build seperate models for every single interaction, and then apply stepwise procedure for every single model to see if interactions are meaningful or not. If some of the models reduce down to something without the interaction term of interest, then there wouldn't be evidence for interaction as the final model is a better fit without the interaction term. Is that also valid?

Or do I have it wrong? Just because a model fits better only makes it work better for prediction.
So if I only care about predicting y and model 1 :y~a+b is a better fit than model2: y~a+b+a*b, then it is only sound to pick model1 over model2 if all we care about is prediction of y.

If we care about whether there is interaction or not, then we pick model2, so long as the fit isn't terrible in comparision to model1, because if the fit is terrible, then the inferences would be invalid anyways.

But theres also some overlap between the two right? That is, if interaction can help explain, then it should be able to help predict as well. So the situation I came up with of model1 vs model2 would not be common?

Last edited:
MarneMath