# Prediction vs Explanation

• A
• FallenApple

#### FallenApple

Ok so these are two different goals. But mathematically, I don't see how one can explain well without also being able to predict well. After all, regression is about function estimation regardless of which goal. If we can infer well using the coefficients, then we should be able to predict well also. So say our regression is done well in accords to scientific and statistical procedures to answer a question. Then if we get new data, and we can't make good predictions, then how good was our original regression estimates? So I can't see how they don't go hand in hand.

## Answers and Replies

I think that good prediction must follow from good explanation, but not the other way around.

If we have an intuitive, satisfying explanation of why something happens, which is consistent with all our other accepted theories, but the predictions that follow from that explanation are not borne out by observation, there is probably something wrong with the explanation. Indeed it is sometimes failures of prediction for well-understood and highly-trusted theories that leads to their replacement by more sophisticated theories with wider scope of application.

But good prediction does not imply that we have a good explanation. It is much more satisfying when one has an explanation of a relationship, but there are many observed relationships that are widely used, for which we have no explanations. For example there are plenty of approved drugs that have been repeatedly shown to be effective in treating certain ailments, and which are widely prescribed, but for which no mechanism is known about how and why they work.

When building statistical models, the usual approach is to set the bar lower for inclusion of a factor that is called 'intuitive' - ie for which we can imagine a reason why it would affect the output in the way it has been observed to do. We might for instance set a lower confidence level or improvement in model score as the threshold that must be crossed for an intuitive factor than for an unintuitive factor. But sometimes the statistical evidence for inclusion of a factor is just too strong, even though we are unable to imagine a reason why it should affect the output in the way it has been observed to do.

One reason unintuitive factors make their way into models is the existence of interaction effects between factors. When a model has several factors there can be many levels of interaction and the number of possible interactions explodes combinatorically, with many of them hard to conceptualise. But as long as we set a high enough requirement of impact before including an unintuitive factor, it would be counter-productive to rule that out.

FallenApple
I think that good prediction must follow from good explanation, but not the other way around.

If we have an intuitive, satisfying explanation of why something happens, which is consistent with all our other accepted theories, but the predictions that follow from that explanation are not borne out by observation, there is probably something wrong with the explanation. Indeed it is sometimes failures of prediction for well-understood and highly-trusted theories that leads to their replacement by more sophisticated theories with wider scope of application.

But good prediction does not imply that we have a good explanation. It is much more satisfying when one has an explanation of a relationship, but there are many observed relationships that are widely used, for which we have no explanations. For example there are plenty of approved drugs that have been repeatedly shown to be effective in treating certain ailments, and which are widely prescribed, but for which no mechanism is known about how and why they work.

When building statistical models, the usual approach is to set the bar lower for inclusion of a factor that is called 'intuitive' - ie for which we can imagine a reason why it would affect the output in the way it has been observed to do. We might for instance set a lower confidence level or improvement in model score as the threshold that must be crossed for an intuitive factor than for an unintuitive factor. But sometimes the statistical evidence for inclusion of a factor is just too strong, even though we are unable to imagine a reason why it should affect the output in the way it has been observed to do.

One reason unintuitive factors make their way into models is the existence of interaction effects between factors. When a model has several factors there can be many levels of interaction and the number of possible interactions explodes combinatorically, with many of them hard to conceptualise. But as long as we set a high enough requirement of impact before including an unintuitive factor, it would be counter-productive to rule that out.

That makes sense. Another example in science is that Newtonian physics predicts well, but is wrong. But it is a limiting case of General Relativity, so the false theory of Newtonian Physics is still at least directly related to the truth.

If certain medications work well the vast majority of the time, then it likely isn't a coincidence.

Is the mathematical/statistical reason for setting the bar lower for "intuitive variables" is because even if by itself it isn't significant, it could be after including it? Because the error term for the model becomes correlated to the input of confounders? Because that error term wouldn't be irreducible. So that error the would be absorbing some of the influence?

So the unintuitive factor has a higher bar because it is unlikely, given the theory is true, to be a confounder and adding it probably will just complicates things for interpretation because of the combinatorial issue you noted. Also, I think from a predictive standpoint, it would be pretty bad as well right? Because it increases the dimensionality of the input space and will result in higher variance of outcomes after validation.

But wouldn't this result in a tradeoff? Because sometimes there are many confounders, so including them will be necessary for getting good explanation and better model fit(Lower RSS), but will increase of variance of the predicted outcome on a validation set if we were to obtain one.

If we can infer well using the coefficients, then we should be able to predict well also.
This is not true in general. Suppose you have five noisy regression points. You can explain them perfectly with a fourth order polynomial, but the resulting predictions can be much worse than a linear fit which explains less. Prediction and explanation are substantially different things.

Another example in science is that Newtonian physics predicts well, but is wrong. But it is a limiting case of General Relativity, so the false theory of Newtonian Physics is still at least directly related to the truth.
I disagree completely with this. Newtonian physics is verified in its domain of applicability, as centuries of experimental outcomes confirm. You should read Asimov’s “Relativity of wrong” and the recent Insights article about classical mechanics

FallenApple
This is not true in general. Suppose you have five noisy regression points. You can explain them perfectly with a fourth order polynomial, but the resulting predictions can be much worse than a linear fit which explains less. Prediction and explanation are substantially different things.

I disagree completely with this. Newtonian physics is verified in its domain of applicability, as centuries of experimental outcomes confirm. You should read Asimov’s “Relativity of wrong” and the recent Insights article about classical mechanics

That makes sense. Forth order has a lot of curvature and can curve away quickly before the x location of a 6th point used for testing prediction, depending on where that point appears.

I just read the insights article. Yes, Newtonian mechanics being the limiting case of GR implies that it is a subset of the modern GR theory and hence is necessarily as correct as GR itself due to being a subset.

Dale