Would including the true/known y-intercept in my dataset be "overfitting"?

  • Context: Undergrad 
  • Thread starter Thread starter fahraynk
  • Start date Start date
Click For Summary
SUMMARY

Including a known y-intercept, such as (0, 7), in a dataset for regression analysis is not recommended as it can lead to overfitting. Instead, the correct approach is to subtract the known intercept from all y-values, transforming the regression into one without a constant term, and then add the intercept back after fitting the model. This method avoids the statistical problems associated with fitting a model that artificially constrains the intercept, ensuring that the regression analysis remains valid and interpretable.

PREREQUISITES
  • Understanding of linear regression models and their components
  • Familiarity with statistical concepts such as overfitting and residuals
  • Knowledge of regression analysis techniques, including handling intercepts
  • Basic proficiency in statistical software or programming languages for regression analysis
NEXT STEPS
  • Research the implications of fitting linear models without an intercept parameter
  • Learn about restricted regression techniques and their applications
  • Explore statistical measures affected by model constraints, such as R-squared and residual analysis
  • Watch educational videos on regression analysis, particularly those discussing intercept handling
USEFUL FOR

Data scientists, statisticians, and researchers involved in regression analysis who need to understand the implications of including known intercepts in their models.

  • #31
Below is a simple example to illustrate the pros and cons of the two regression models.
Suppose: Suppose the true physics without any random behavior is the green (truth) line Y=X, and suppose we know from theory that it goes through (0,0). Two sample data points, ##S_1## and ##S_2## include some random behavior which puts them above the truth line and forces the typical regression (blue line, ##regr_1##) to have a nonzero Y-axis intercept at point Int. Suppose the restricted red regression line (##reg_2## through (0,0)) is calculated because we theoretically know that Y=0 when X=0 if there is no random behavior.

Then: The red line gives better estimates near (0,0) and worse estimates farther away. Also, its slope is worse. But it has one advantage when challenged by skeptical people -- it is correct at the Y-intercept, where a theoretical answer is known. I would prefer to avoid using the blue line as my model when it is undeniably wrong at the Y-intercept. But that depends on how the model is used. If accuracy far from the Y-intercept is more important than accuracy near the Y-intercept, then you may prefer the model that ignores the known theory of the Y-intercept value.

regressionWithoutConstant.png
 

Attachments

  • regressionWithoutConstant.png
    regressionWithoutConstant.png
    13.1 KB · Views: 354
Last edited:
  • Like
Likes   Reactions: fahraynk
Physics news on Phys.org
  • #32
fahraynk said:
With this information in mind, do you still agree to include the intercept in the measured data points and fit normally with no restrictions?
Yes, but only if you actually measure the intercept. So it should be a real data point treated just like any other real data point.
 
  • Like
Likes   Reactions: fahraynk and FactChecker
  • #33
FactChecker said:
I would not recommend ignoring the theoretical known value, no matter how it changes the statistical results. I see no reason to prefer accurate statistical measures of a known invalid model. I would prefer modified statistical measures of a known valid model.
That can be done, but it has to be a very explicit and convincing argument. Personally, when I see a statistical model done without an intercept I am instantly highly suspicious. The burden of proof is on the scientist much more stringently, and frankly in some infamous papers where I have seen this done it was done poorly and rendered the conclusion completely unbelievable.

So overall, my opinion is opposite yours. I would rather use a standard and reasonable statistical process and look at the fitted intercept as a check on the quality of the data and the model.
 
  • Like
Likes   Reactions: fahraynk, StoneTemplePython and FactChecker
  • #34
Dale said:
That can be done, but it has to be a very explicit and convincing argument.
I agree completely with that. The theoretical reason behind the Y-intercept of 7 must be hard, solid science. If there is any doubt, the standard regression intercept should be allowed.
 
  • Like
Likes   Reactions: fahraynk and Dale
  • #35
@fahraynk FYI both @FactChecker And I agree about this point also
FactChecker said:
If that is not a measured experimental data point, then it is a theoretical point. If it's a theoretical true value, why only include it once? Why not two or three times? Or 20 times? Pretending that the theoretical intercept is an experimental sample data point is the wrong way to go.
So don’t make a fictitious data point.
 
Last edited:
  • Like
Likes   Reactions: FactChecker and fahraynk
  • #36
Dale said:
That can be done, but it has to be a very explicit and convincing argument. Personally, when I see a statistical model done without an intercept I am instantly highly suspicious. The burden of proof is on the scientist much more stringently, and frankly in some infamous papers where I have seen this done it was done poorly and rendered the conclusion completely unbelievable.

So overall, my opinion is opposite yours. I would rather use a standard and reasonable statistical process and look at the fitted intercept as a check on the quality of the data and the model.
I think that I have to concede that your approach is the wiser approach. Forcing the Y-intercept to a particular value would only be appropriate if there is indisputible consensus based on science.
 
  • #37
The problem is stated mathematically as:
ε2=Σ(m xi+b-yi)2
b=7
Minimizing the mean square error (ε2)
2/dm = 0 = Σ(m xi+b-yi) xi
Solve above for m and that’s your fitting equation, y=mx+7. m will be expressed entirely in terms of known quantities with a guaranteed intercept of 7.
 

Similar threads

  • · Replies 11 ·
Replies
11
Views
3K
  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 19 ·
Replies
19
Views
3K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 6 ·
Replies
6
Views
4K
  • · Replies 7 ·
Replies
7
Views
2K