Would including the true/known y-intercept in my dataset be "overfitting"?

  • Context: Undergrad 
  • Thread starter Thread starter fahraynk
  • Start date Start date
Click For Summary

Discussion Overview

The discussion revolves around the implications of including a known y-intercept in a dataset for regression analysis. Participants explore whether this practice constitutes overfitting and discuss various approaches to fitting models with a known intercept, including statistical justifications and potential pitfalls.

Discussion Character

  • Debate/contested
  • Technical explanation
  • Mathematical reasoning

Main Points Raised

  • One participant questions if including the known y-intercept (e.g., at X=0, Y=7) in the dataset leads to overfitting, suggesting it may be "cheating."
  • Another participant proposes adjusting the data by subtracting the known intercept from all Y values and fitting a regression model without a constant term, arguing this method does not lead to overfitting.
  • A different perspective emphasizes the need for a theoretical justification when fitting a model without an intercept, noting that standard statistical measures may not apply.
  • Some participants express confusion about the concept of overfitting in this context, with one suggesting that reducing free parameters should not be considered overfitting.
  • One participant shares a personal experience of needing to add multiple fake observations to achieve a desired fit, questioning the validity of this approach.
  • Another participant discusses the implications of knowing the intercept based on experimental data and considers whether it is better to include this information directly or manipulate the dataset in other ways.
  • Several participants seek clarification on the differences between various methods of handling the known intercept, including adding it to the dataset, subtracting it from other values, or applying restrictions in the regression model.

Areas of Agreement / Disagreement

Participants express differing views on whether including a known intercept constitutes overfitting, with no consensus reached. Some advocate for including the intercept in the model, while others caution against it due to potential statistical issues.

Contextual Notes

Participants highlight the complexity of fitting models with known parameters and the potential for statistical measures to be misinterpreted when certain assumptions are made. The discussion reflects a variety of approaches and concerns regarding the validity of different methodologies.

  • #31
Below is a simple example to illustrate the pros and cons of the two regression models.
Suppose: Suppose the true physics without any random behavior is the green (truth) line Y=X, and suppose we know from theory that it goes through (0,0). Two sample data points, ##S_1## and ##S_2## include some random behavior which puts them above the truth line and forces the typical regression (blue line, ##regr_1##) to have a nonzero Y-axis intercept at point Int. Suppose the restricted red regression line (##reg_2## through (0,0)) is calculated because we theoretically know that Y=0 when X=0 if there is no random behavior.

Then: The red line gives better estimates near (0,0) and worse estimates farther away. Also, its slope is worse. But it has one advantage when challenged by skeptical people -- it is correct at the Y-intercept, where a theoretical answer is known. I would prefer to avoid using the blue line as my model when it is undeniably wrong at the Y-intercept. But that depends on how the model is used. If accuracy far from the Y-intercept is more important than accuracy near the Y-intercept, then you may prefer the model that ignores the known theory of the Y-intercept value.

regressionWithoutConstant.png
 

Attachments

  • regressionWithoutConstant.png
    regressionWithoutConstant.png
    13.1 KB · Views: 359
Last edited:
  • Like
Likes   Reactions: fahraynk
Physics news on Phys.org
  • #32
fahraynk said:
With this information in mind, do you still agree to include the intercept in the measured data points and fit normally with no restrictions?
Yes, but only if you actually measure the intercept. So it should be a real data point treated just like any other real data point.
 
  • Like
Likes   Reactions: fahraynk and FactChecker
  • #33
FactChecker said:
I would not recommend ignoring the theoretical known value, no matter how it changes the statistical results. I see no reason to prefer accurate statistical measures of a known invalid model. I would prefer modified statistical measures of a known valid model.
That can be done, but it has to be a very explicit and convincing argument. Personally, when I see a statistical model done without an intercept I am instantly highly suspicious. The burden of proof is on the scientist much more stringently, and frankly in some infamous papers where I have seen this done it was done poorly and rendered the conclusion completely unbelievable.

So overall, my opinion is opposite yours. I would rather use a standard and reasonable statistical process and look at the fitted intercept as a check on the quality of the data and the model.
 
  • Like
Likes   Reactions: fahraynk, StoneTemplePython and FactChecker
  • #34
Dale said:
That can be done, but it has to be a very explicit and convincing argument.
I agree completely with that. The theoretical reason behind the Y-intercept of 7 must be hard, solid science. If there is any doubt, the standard regression intercept should be allowed.
 
  • Like
Likes   Reactions: fahraynk and Dale
  • #35
@fahraynk FYI both @FactChecker And I agree about this point also
FactChecker said:
If that is not a measured experimental data point, then it is a theoretical point. If it's a theoretical true value, why only include it once? Why not two or three times? Or 20 times? Pretending that the theoretical intercept is an experimental sample data point is the wrong way to go.
So don’t make a fictitious data point.
 
Last edited:
  • Like
Likes   Reactions: FactChecker and fahraynk
  • #36
Dale said:
That can be done, but it has to be a very explicit and convincing argument. Personally, when I see a statistical model done without an intercept I am instantly highly suspicious. The burden of proof is on the scientist much more stringently, and frankly in some infamous papers where I have seen this done it was done poorly and rendered the conclusion completely unbelievable.

So overall, my opinion is opposite yours. I would rather use a standard and reasonable statistical process and look at the fitted intercept as a check on the quality of the data and the model.
I think that I have to concede that your approach is the wiser approach. Forcing the Y-intercept to a particular value would only be appropriate if there is indisputible consensus based on science.
 
  • #37
The problem is stated mathematically as:
ε2=Σ(m xi+b-yi)2
b=7
Minimizing the mean square error (ε2)
2/dm = 0 = Σ(m xi+b-yi) xi
Solve above for m and that’s your fitting equation, y=mx+7. m will be expressed entirely in terms of known quantities with a guaranteed intercept of 7.
 

Similar threads

  • · Replies 11 ·
Replies
11
Views
3K
  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 19 ·
Replies
19
Views
3K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 6 ·
Replies
6
Views
4K
  • · Replies 7 ·
Replies
7
Views
2K