What is the reasoning for this prediction?

  • A
  • Thread starter FallenApple
  • Start date
  • Tags
    Prediction
In summary, the conversation discusses the use of data-driven methods to predict the number of medical doctors in a county. The starting model is an additive linear regression model, and additional interaction terms are added through an F-test comparison. The use of AIC and other methods is to avoid obscuring main effects and make the model more interpretable. It is suggested that the use of stepwise procedures and building separate models for each interaction may be more effective in investigating interactions. It is also noted that complex models may be preferred for prediction, but simpler models may be better for interpretation of parameters.
  • #1
FallenApple
566
61
I am looking at a solution. The problem is to predict the number of medical doctors in a county. The data set has a few variables such as the observed number of crimes in that county, proverty level etc.

Since the problem is just to predict, then I assume that a data driven method is necessary.

It turned out, a data driven method was used, but somewhat in combination with logical reasoning about the problem( i.e what variable might logically interact with another etc)
The starting model is a linear regression full model with all the variables included but with no interaction term.
Then AIC, mallows CP, best subset was used. So we get 3 models, one from each of the three algorihms. The AIC model was picked because it was the most parsimonious. That and that it had all of the indicators for the regions of the US( i.e east, south etc). I suppose the reason for this is that the boxplot of doctors by region showed that the number of doctors might differ by region.

Then after the additive model was chosen. An Anova F test was used to compare the updated interaction models.

So if the model from AIC was y~a+b+c Anova(y~a+b+c, y~a+b+c+b:c) and Anova(y~a+b+c, y~a+b+c+a:c) where those are the suspected interactions. From mostly logical reasoning. i.e income of a county might have a differential effect on number of doctors across different levels of crime etc)So in summary: 1)it was algorithms applied on an additive model with variables in the data set. 2) use the model from part 1) and add on the interaction terms. Then do an F test to compare model2 with model1 to see if an interaction included can better predict the number of physicians in a county.

Now my question is,

1) why didn't they just use the full model with the interaction terms in it to begin with and then apply the algorithms? That way, there is no need to follow up with a model comparison with the F-test. 2)And why AIC and the other methods? Why not just use one of them?
 
Physics news on Phys.org
  • #2
Hi Apple:

I recommend the following non-technical book.
Weapons Of Math Destruction By Cathy O'Neal (2016)​
It is about the dangers of certain frequently used methods for implementing models used to make predictions.

Regards,
Buzz
 
  • #3
There's myriad of reasons why a person may do this. Usually it's because including the full model would contain to many parameters and may obscure the main effects. It could be that backwards selection from the full model is too computationally intense. It could be that the person only cared to investigate interactions where the main effects were meaningful. (ie if you started from the full model you may include interactions that are uninteresting but those interactions basically force you to add those variables main effects into the model.)
 
  • Like
Likes FallenApple
  • #4
MarneMath said:
There's myriad of reasons why a person may do this. Usually it's because including the full model would contain to many parameters and may obscure the main effects. It could be that backwards selection from the full model is too computationally intense. It could be that the person only cared to investigate interactions where the main effects were meaningful. (ie if you started from the full model you may include interactions that are uninteresting but those interactions basically force you to add those variables main effects into the model.)

Thanks. That makes sense. So if there are interactions that we know from the beginning that are impossible, then we should not include them.

What if we want to investigate what might give rise to interactions? So even here we would not want to have a starting model that includes all possible interaction pairs. It's not that we might discover that no interaction exists for many of the pairs, but that the results are not even that useful since we would started with an overly complex model which makes the final model invalid. Then for that case, it would be better to build separate models for every single interaction, and then apply stepwise procedure for every single model to see if interactions are meaningful or not. If some of the models reduce down to something without the interaction term of interest, then there wouldn't be evidence for interaction as the final model is a better fit without the interaction term. Is that also valid?

Or do I have it wrong? Just because a model fits better only makes it work better for prediction.
So if I only care about predicting y and model 1 :y~a+b is a better fit than model2: y~a+b+a*b, then it is only sound to pick model1 over model2 if all we care about is prediction of y.

If we care about whether there is interaction or not, then we pick model2, so long as the fit isn't terrible in comparision to model1, because if the fit is terrible, then the inferences would be invalid anyways.

But there's also some overlap between the two right? That is, if interaction can help explain, then it should be able to help predict as well. So the situation I came up with of model1 vs model2 would not be common?
 
Last edited:
  • #5
First, stepwise process of building a model is generally not used in practice. Secondly, investigating what gives rise to an interaction cannot really be done via a data driven approach. For example, if you were looking at the amount of iron released by different foods cooked in different types of pots, you'll notice that tomatoes and cast iron skillets have an interaction. Nothing in the data will tell you why they interact, just that they do. To figure out why this interaction occurs, you need an experiment an hypothesis to test.

Actually as regards to model prediction, if model1 and model2 are valid models but selected by different criteria, then it's generally preferred to go with the complex model. However, if you need to interpret the parameters of the model, it's generally better to go with the simple one. Furthermore in the case you provided, if an interaction exist, then it implies that a+b would not be a sufficient model even for interpretation. Basically because geometrically, the model represents two lines with a different slope and intercept. While your current model has the same intercept.

The thing is, it doesn't matter if we care if there is interaction or not, if there is significant interaction, we should include it in our model. First, because it'll probably help our prediction and secondly if we attempt to interpret the parameters we could be wrong. For example, if interaction exist in an experiment then main effects needs to be analyzed at different level of factors.
 

1. What factors were considered in making this prediction?

In order to make a prediction, we take into account a variety of factors such as existing data, trends, patterns, and scientific principles. We also consider any relevant external factors that may influence the outcome.

2. How accurate is this prediction?

The accuracy of a prediction can vary depending on the complexity of the subject and the available data. We use statistical methods and rigorous testing to ensure the highest possible accuracy, but it is important to keep in mind that no prediction is 100% accurate.

3. Can this prediction be replicated?

Yes, our predictions are based on scientific principles and can be replicated by following the same methodology and using the same data. This allows for further testing and validation of the prediction.

4. What are the limitations of this prediction?

Like any scientific prediction, there are limitations that should be considered. These may include incomplete or inaccurate data, unforeseen external factors, or limitations in our current understanding of the subject.

5. How do you ensure the validity of your prediction?

Our predictions go through a rigorous process of testing and validation to ensure their validity. This includes using multiple methods, analyzing various data sets, and seeking peer review from other scientists in the field.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
984
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
502
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
10
Views
963
  • Beyond the Standard Models
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
484
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
963
Back
Top