# A What is the reasoning for this prediction?

1. May 1, 2017

### FallenApple

I am looking at a solution. The problem is to predict the number of medical doctors in a county. The data set has a few variables such as the observed number of crimes in that county, proverty level etc.

Since the problem is just to predict, then I assume that a data driven method is necessary.

It turned out, a data driven method was used, but somewhat in combination with logical reasoning about the problem( i.e what variable might logically interact with another etc)

The starting model is a linear regression full model with all the variables included but with no interaction term.
Then AIC, mallows CP, best subset was used. So we get 3 models, one from each of the three algorihms. The AIC model was picked because it was the most parsimonious. That and that it had all of the indicators for the regions of the US( i.e east, south etc). I suppose the reason for this is that the boxplot of doctors by region showed that the number of doctors might differ by region.

Then after the additive model was chosen. An Anova F test was used to compare the updated interaction models.

So if the model from AIC was y~a+b+c Anova(y~a+b+c, y~a+b+c+b:c) and Anova(y~a+b+c, y~a+b+c+a:c) where those are the suspected interactions. From mostly logical reasoning. i.e income of a county might have a differential effect on number of doctors across different levels of crime etc)

So in summary: 1)it was algorithms applied on an additive model with variables in the data set. 2) use the model from part 1) and add on the interaction terms. Then do an F test to compare model2 with model1 to see if an interaction included can better predict the number of physicians in a county.

Now my question is,

1) why didn't they just use the full model with the interaction terms in it to begin with and then apply the algorithms? That way, there is no need to follow up with a model comparison with the F-test.

2)And why AIC and the other methods? Why not just use one of them?

2. May 1, 2017

### Buzz Bloom

Hi Apple:

I recommend the following non-technical book.
Weapons Of Math Destruction By Cathy O'Neal (2016)​
It is about the dangers of certain frequently used methods for implementing models used to make predictions.

Regards,
Buzz

3. May 1, 2017

### MarneMath

There's myriad of reasons why a person may do this. Usually it's because including the full model would contain to many parameters and may obscure the main effects. It could be that backwards selection from the full model is too computationally intense. It could be that the person only cared to investigate interactions where the main effects were meaningful. (ie if you started from the full model you may include interactions that are uninteresting but those interactions basically force you to add those variables main effects into the model.)

4. May 1, 2017

### FallenApple

Thanks. That makes sense. So if there are interactions that we know from the beginning that are impossible, then we should not include them.

What if we want to investigate what might give rise to interactions? So even here we would not want to have a starting model that includes all possible interaction pairs. It's not that we might discover that no interaction exists for many of the pairs, but that the results are not even that useful since we would started with an overly complex model which makes the final model invalid. Then for that case, it would be better to build seperate models for every single interaction, and then apply stepwise procedure for every single model to see if interactions are meaningful or not. If some of the models reduce down to something without the interaction term of interest, then there wouldn't be evidence for interaction as the final model is a better fit without the interaction term. Is that also valid?

Or do I have it wrong? Just because a model fits better only makes it work better for prediction.
So if I only care about predicting y and model 1 :y~a+b is a better fit than model2: y~a+b+a*b, then it is only sound to pick model1 over model2 if all we care about is prediction of y.

If we care about whether there is interaction or not, then we pick model2, so long as the fit isn't terrible in comparision to model1, because if the fit is terrible, then the inferences would be invalid anyways.

But theres also some overlap between the two right? That is, if interaction can help explain, then it should be able to help predict as well. So the situation I came up with of model1 vs model2 would not be common?

Last edited: May 1, 2017
5. May 2, 2017

### MarneMath

First, stepwise process of building a model is generally not used in practice. Secondly, investigating what gives rise to an interaction cannot really be done via a data driven approach. For example, if you were looking at the amount of iron released by different foods cooked in different types of pots, you'll notice that tomatoes and cast iron skillets have an interaction. Nothing in the data will tell you why they interact, just that they do. To figure out why this interaction occurs, you need an experiment an hypothesis to test.

Actually as regards to model prediction, if model1 and model2 are valid models but selected by different criteria, then it's generally preferred to go with the complex model. However, if you need to interpret the parameters of the model, it's generally better to go with the simple one. Furthermore in the case you provided, if an interaction exist, then it implies that a+b would not be a sufficient model even for interpretation. Basically because geometrically, the model represents two lines with a different slope and intercept. While your current model has the same intercept.

The thing is, it doesn't matter if we care if there is interaction or not, if there is significant interaction, we should include it in our model. First, because it'll probably help our prediction and secondly if we attempt to interpret the parameters we could be wrong. For example, if interaction exist in an experiment then main effects needs to be analyzed at different level of factors.