Would including the true/known y-intercept in my dataset be "overfitting"?

fahraynk · May 4, 2018

Suppose I have a set of datapoints which I want to fit, and suppose I know what the data's true y intercept is, for example I know at X=0, Y=7. If I include this point in my data is it overfitting? Since the model needs to find the intercept, it has no choice but to choose x=0 and y=7. Is it cheating if I happen to know that point is the true intercept and include it in my data to be fit?

If so, what should I do in this case?

FactChecker · May 4, 2018

You can adjust your data for a (0,0) intercept by subtracting 7 from all Y values and fit a regression model with no constant term: ##Y = a\cdot X##. Then add the 7 back into your model: ##Y = a\cdot X + 7##. This is not overfitting.

fahraynk · May 4, 2018

FactChecker said:

You can adjust your data for a (0,0) intercept by subtracting 7 from all Y values and fit a regression model with no constant term: ##Y = a\cdot X##. Then add the 7 back into your model: ##Y = a\cdot X + 7##. This is not overfitting.

So to summarize, do not include this data point in the fit, subtract it from all other Y values, and than add it back after the fit.
Why does it increase overfitting if I include the known data point in the fit? I am really curious.
Do you know something I can read or a term I can google which will explain it to me in more detail?

Dale · May 4, 2018

You need to have a very firm theoretical justification for doing this. Fitting a linear model without an intercept parameter is a very big statistical problem. If you do it, the usual statistical measures don’t mean what they usually do. For example, the R^2 is no longer the proportion of the variance explained, your residuals will not have 0 mean, all of your other parameters will be biased, etc.

If your data is truly guaranteed to have a specific intercept then simply fit a model with an intercept parameter as always. If you are right then the parameter will not be significantly different from the theoretical value, and your statistical analysis of the remaining parameter will work better. If you are wrong then the data will show you.

FactChecker · May 4, 2018

I guess that @Dale is right about the statistical problems. It you know that the intercept is at y=7 then it would be strange if confidence interval of the regression constant did not include y=7. If you want a well accepted statistical result, you may have to accept the regression intercept.

Dale · May 5, 2018

fahraynk said:

for example I know at X=0, Y=7. If I include this point in my data is it overfitting?

I realized that I may have been misinterpreting what you were saying. I thought that you were saying to take your model, y=mx+b, set b to 7 and fit only m. That is a statistically problematic thing to do, although sometimes it can be done with proper precautions and care.

Upon re-reading your post, however, it seems like what you actually want to do is to take your data set of N observations of (x,y) and add 1 fake observation (0,7) for a total of N+1 data points.

If that is your approach then I would ask what makes you believe that at X=0 Y=7? If that is based on actual data then why not replicate that data, or directly sample X=0 and see. In other words, instead of making a fake data point, why not make it real?

FactChecker · May 5, 2018

Dale said:

Upon re-reading your post, however, it seems like what you actually want to do is to take your data set of N observations of (x,y) and add 1 fake observation (0,7) for a total of N+1 data points.

Once, in a prior lifetime, I tried that something like that. I was surprised at how many (0,7) points I had to add to get the result I wanted. Each one nudges it closer to going through (0,7), but it took a very large number to get reasonably close. In any case, it makes the whole process questionable. I suppose one could estimate how confident one is that (0,7) is valid and how much data it would take to change his mind; then put a corresponding number of (0,7) points in. There may be some crude rationalization possible for that, but it seems very ad-hoc to me.

The subject of linear regression given coefficient constraints (aka restricted regression) has been studied (see, for instance, section 5.6 of Searle, "Linear Models") and there are implementations (e.g. SAS RESTRICT statement). There are some youtube videos that may be relevant (e.g. https://www.youtube.com/watch?v=K-PsdZDaNDE). I have no experience with it, but I assume that the mathematical/statistical approach is valid.

fahraynk · May 5, 2018

Dale said:

I realized that I may have been misinterpreting what you were saying. I thought that you were saying to take your model, y=mx+b, set b to 7 and fit only m. That is a statistically problematic thing to do, although sometimes it can be done with proper precautions and care.

Upon re-reading your post, however, it seems like what you actually want to do is to take your data set of N observations of (x,y) and add 1 fake observation (0,7) for a total of N+1 data points.

If that is your approach then I would ask what makes you believe that at X=0 Y=7? If that is based on actual data then why not replicate that data, or directly sample X=0 and see. In other words, instead of making a fake data point, why not make it real?

This comes from mixing two molecules together. In the beginning I have pure molecule 1, and so I know the values of molecule 1 for the first measurement.

As soon as I mix them, I don't know how much is molecule 1, and how much is a combination of molecule 1 and molecule 2. I just get the measurements and I have to interpret them.

So I can use the first measurement, where I know the amount corresponds to 100% molecule 1 and none of molecule 2. It is like an intercept. I use y=mx+b as an example because its easier to talk about the theory.

Should I add the first value, or subtract it from all the others like FactChecker says and fit that? If so, why? Why is knowing the intercept considered overfitting?

fahraynk · May 5, 2018

I found this, do you think this would work for my problem, instead of setting the first coefficient myself, or instead of putting the known point in the dataset, if I instead do a regression analysis like this, would it not increase my overfitting?

FactChecker · May 5, 2018

fahraynk said:

Why is knowing the intercept considered overfitting?

Where have you heard that? I interpret the term "overfitting" as having too many free parameters for the amount of data. In the extreme case, too many free parameters make it possible to develop a line that goes artificially through every data point. That is meaningless. I have never heard the term "overfitting" used when the number of free parameters is reduced.

fahraynk · May 5, 2018

FactChecker said:

I have never heard the term "overfitting" used when the number of free parameters is reduced.

Neither have I! That is why I don't get it!

But apparently it is done in the field in popular software.

it might help to see the model equation:

$$\frac{G}{Gt}\beta_1 + \frac{HG}{Gt}\beta_2 = Y^{calculated}$$
##G## and ##HG## are molecule 1 and 2, respectively.
##Gt## is total amount of G present in both molecules. Later on, some ##G## changes into ##HG##. I only know the value of ##G## at the first point, and I get a measurement at that point which corresponds to ##\frac{G}{Gt}=1## and ##\frac{HG}{Gt}=0##

So technically I think that means I know ##\beta_1##, which is the measurement I get at ##\frac{G}{HG}=1##. I could ignore this and just add that measurement to the total data, I can set a restriction like in my last post, or I can do what you said, subtract the first measurement from the rest of the data and fit the difference.

But what is the difference between the 3 methods? I have been told adding the first measurement to the data would be overfitting, but I don't know why. And people in the field use your method of subtracting the measurement from all the others. I prefer setting the restriction if it would work, because it is a fancier looking option

Dale · May 5, 2018

fahraynk said:

This comes from mixing two molecules together. In the beginning I have pure molecule 1, and so I know the values of molecule 1 for the first measurement.

So just make your usual measurement on a sample of the pure molecule 1. Then it is a completely legitimate data point.

fahraynk said:

Should I add the first value, or subtract it from all the others like FactChecker says and fit that?

Once you have a legitimate data point then just treat it like all of the other data points.

fahraynk said:

Why is knowing the intercept considered overfitting?

It isn’t, as far as I know.

fahraynk said:

I have been told adding the first measurement to the data would be overfitting, but I don't know why.

I think that you need to go to the person who told you that and ask for an explanation. It doesn’t seem right to me, but maybe they are making some subtle point.

fahraynk · May 6, 2018

Dale said:

So just make your usual measurement on a sample of the pure molecule 1. Then it is a completely legitimate data point.

Once you have a legitimate data point then just treat it like all of the other data points.

It isn’t, as far as I know.

I think that you need to go to the person who told you that and ask for an explanation. It doesn’t seem right to me, but maybe they are making some subtle point.

Thanks for the reply,

so, you think its better to add this data point to the rest of the data, and just let the solver figure it out on its own, rather than set a restriction such that ##\beta_1## is equal to the measured value at the first point?

Do you have any idea where fact checkers idea to subtract the first value from all the other points comes from? I ask because that is what others are telling me to do. I just want to have solid information as to why before I do something weird like that!

I pretty much agree with you though, that it should just be considered a legit data point, and let the solver do whatever it wants.

FactChecker · May 6, 2018

fahraynk said:

Do you have any idea where fact checkers idea to subtract the first value from all the other points comes from? I ask because that is what others are telling me to do. I just want to have solid information as to why before I do something weird like that!

I never meant for you to "subtract the first value". The idea is to subtract the theoretically known intercept value of 7 from all Y data values. That changes the regression into one with a new dependent variable, ##Y_2 = Y - 7##, whose regression line should theoretically go through (0,0). That is simpler and there are several articles and Youtube videos on how to handle a regression with no (zero) constant term like that. Once you obtain such a regression line, ##Y_2 = aX##, you have an equation for ##Y = Y_2 + 7 = aX + 7##

Dale · May 6, 2018

fahraynk said:

so, you think its better to add this data point to the rest of the data, and just let the solver figure it out on its own, rather than set a restriction such that β1β1\beta_1 is equal to the measured value at the first point?

Yes, I do. Restrictions can be added, but they often have unintended consequences such as introducing bias in the other regression parameters and producing residuals that are not 0 mean.

fahraynk · May 6, 2018

FactChecker said:

I never meant for you to "subtract the first value". The idea is to subtract the theoretically known intercept value of 7 from all Y data values. That changes the regression into one with a new dependent variable, ##Y_2 = Y - 7##, whose regression line should theoretically go through (0,0). That is simpler and there are several articles and Youtube videos on how to handle a regression with no (zero) constant term like that. Once you obtain such a regression line, ##Y_2 = aX##, you have an equation for ##Y = Y_2 + 7 = aX + 7##

Yeah, sorry for confusion, [0,7] is considered the "first value" in my data set. Interestingly I calculated 40 different results by subtracting the first intercept and by the matrix equations setting a restriction on the first intercept that I posted above, and both produce identical results and residuals! So the subtraction of the known intercept must be the same as setting a restriction on the intercept. Thanks for sending me to the Youtube videos.

Dale said:

Yes, I do. Restrictions can be added, but they often have unintended consequences such as introducing bias in the other regression parameters and producing residuals that are not 0 mean.

Thanks, I agree with you in not liking the restriction. Is there any way you know of that I can quantify (or understand intuitively if not quantify) the bias caused with and without restrictions?

FactChecker · May 6, 2018

fahraynk said:

Yeah, sorry for confusion, [0,7] is considered the "first value" in my data set. Interestingly I calculated 40 different results by subtracting the first intercept and by the matrix equations setting a restriction on the first intercept that I posted above, and both produce identical results and residuals! So the subtraction of the known intercept must be the same as setting a restriction on the intercept. Thanks for sending me to the Youtube videos.

Ok. There are two possibilities:
1) You only know that the data you collected had a datapoint (0,7), but you have no "indisputible" theoretical reason to say that the average value at X=0 will be Y=7. In that case, you just have a typical linear regression and should do nothing unusual.
2) You have an "indisputible" theoretical reason to know that the average value at X=0 is Y=7 which is solid enough that you do not want any statistical result which disagrees with it. Then you should use a model that forces the linear regression through X=0, Y=7.

The second situation was how I interpreted your original post. All my posts above have been only for that situation.

fahraynk · May 6, 2018

FactChecker said:

Ok. There are two possibilities:
1) You only know that the data you collected had a datapoint (0,7), but you have no "indisputible" theoretical reason to say that the average value at X=0 will be Y=7. In that case, you just have a typical linear regression and should do nothing unusual.
2) You have an "indisputible" theoretical reason to know that the average value at X=0 is Y=7 which is solid enough that you do not want any statistical result which disagrees with it. Then you should use a model that forces the linear regression through X=0, Y=7.

The second situation was how I interpreted your original post. All my posts above have been only for that situation.

Yeah, the second situation is correct. I know at x=0 y=7 for a fact.
So, you think that the results will be worse if I don't force the model to conform?
I want to quantify this somehow.
Most of the answers I compute are really close with or without restricting the intercept, but there is one case (out of 40) where the answer I get with a restricted model is 28, and without a restricted model the answer is 114. I am not sure which is better, or how to explain it. The residual is always better with the restricted model, but I assume it is because it is fitting 1 less data point. (there is like 12 data points in total for each case I am fitting, but the data is true, with very small noise variation, so I guess that fitting 11 points vs 12 might show in the residual)

FactChecker · May 6, 2018

Y

fahraynk said:

Yeah, the second situation is correct. I know at x=0 y=7 for a fact.
So, you think that the results will be worse if I don't force the model to conform?
I want to quantify this somehow.
Most of the answers I compute are really close with or without restricting the intercept, but there is one case (out of 40) where the answer I get with a restricted model is 28, and without a restricted model the answer is 114. I am not sure which is better, or how to explain it. The residual is always better with the restricted model, but I assume it is because it is fitting 1 less data point. (there is like 12 data points in total for each case I am fitting, but the data is true, with very small noise variation, so I guess that fitting 11 points vs 12 might show in the residual)

If you know for a fact that the average value of Y at X=0 must be 7, then you have no choise. All else being equal, you must give preference to a model that is correct over one that is wrong.

FactChecker · May 6, 2018

Here is where the term "overfitting" applies to this issue. By allowing the regression model to use an intercept that is wrong, it will provide a "better" fit to the data which is erroneous. That is "overfitting" because it is allowing a better fit due to an erronious free constant parameter.

fahraynk · May 6, 2018

FactChecker said:

Here is where the term "overfitting" applies to this issue. By allowing the regression model to use an intercept that is wrong, it will provide a "better" fit to the data which is erroneous. That is "overfitting" because it is allowing a better fit due to an erronious free constant parameter.

Ah!
Nice. Thanks!
But, it can't be that bad if the residual is lower for the case with the restriction! I guess the solutions are close because the model is trying to solve for that parameter anyway, and it has a lot of weight since their is only 12 data points. That explanation is perfect.

Dale said:

Yes, I do. Restrictions can be added, but they often have unintended consequences such as introducing bias in the other regression parameters and producing residuals that are not 0 mean.

When should this trade off be made? Is it quantifiable? Because I have 40 data sets, and 1 out of 40 gives much different results with and without adding the restriction, while all the others produce about the same value with and without the restriction. I have to choose between the models, should I just go with the one with the lower residual?

Dale · May 6, 2018

fahraynk said:

When should this trade off be made? Is it quantifiable?

No, there isn’t a quantitative test you can apply that would justify it. Basically, you would need to have a reason so convincing that people reading it would agree it is necessary.

fahraynk said:

Because I have 40 data sets, and 1 out of 40 gives much different results with and without adding the restriction, while all the others produce about the same value with and without the restriction. I have to choose between the models, should I just go with the one with the lower residual?

Based on this, it seems like you should not use the restriction. It is difficult to justify and it doesn’t make a difference more than 95% of the time. The one time that it makes a difference is very likely to be a statistical outlier since you would expect one or more with 40 data sets.

FactChecker · May 6, 2018

fahraynk said:

Ah!
Nice. Thanks!
But, it can't be that bad if the residual is lower for the case with the restriction!

That sounds wrong. If the only thing you did was to allow a regression constant in one model and not allow it in another model, the one where the regression constant is allowed must have a smaller sum-squared residual error. It has more freedom to fit the data while ignoring the theoretical x=0,y=7 intercept value. That being said, its smaller sum-squared residual error might give the regression a worse statistical measure of fit because of its added degree of freedom.

Dale · May 6, 2018

A lot of software packages change the meaning of the residuals when you fit to a model without an intercept. So you often cannot compare residuals with and without an intercept. It is important to actually read and thoroughly understand the documentation of the specific package if you choose this route.

fahraynk · May 6, 2018

Dale said:

A lot of software packages change the meaning of the residuals when you fit to a model without an intercept. So you often cannot compare residuals with and without an intercept. It is important to actually read and thoroughly understand the documentation of the specific package if you choose this route.

I programmed the solver in this case

fahraynk · May 6, 2018

FactChecker said:

That sounds wrong. If the only thing you did was to allow a regression constant in one model and not allow it in another model, the one where the regression constant is allowed must have a smaller sum-squared residual error. It has more freedom to fit the data while ignoring the theoretical x=0,y=7 intercept value. That being said, its smaller sum-squared residual error might give the regression a worse statistical measure of fit because of its added degree of freedom.

We are talking about the sum of squared error between 12 points and the the prediction vs the sum of square error between 11 points and prediction, but the difference is small, they are all pretty close.
If the solver would choose a value near the true intercept anyway, the difference may be small, and that small difference might not be overcome by the 1 additional data point, no?

Thank you both for all your help by the way, you have been really helpful and awesome dale and FactChecker

FactChecker · May 7, 2018

fahraynk said:

We are talking about the sum of squared error between 12 points and the the prediction vs the sum of square error between 11 points and prediction, but the difference is small, they are all pretty close.
If the solver would choose a value near the true intercept anyway, the difference may be small, and that small difference might not be overcome by the 1 additional data point, no?

Ok. It looks to me like you are still talking about adding one point at the theoretical intercept and comparing with and without that point. Is that right? I would not recomment that at all. If you know the theoretical intercept, define a model for that (with an intercept at (0,7)) and do the statistics for that. Otherwise, allow the regression to determine a best-fit intercept as a normal part of it's regression. In either case, start with the identical number of sample data -- just use different models.

Dale · May 7, 2018

FactChecker said:

If you know the theoretical intercept, define a model for that (with an intercept at (0,7)) and do the statistics for that.

I would not recommend that approach. It is statistically non standard and leads to many subtle statistical problems that would need to be addressed. It is also difficult to convince readers that it must be done, and if it is done and makes a substantial difference then it indicates a flaw in either your data or your model.

I would recommend actually acquiring the X=0 data points and then doing a standard fit with the intercept.

fahraynk · May 7, 2018

Dale said:

I would not recommend that approach. It is statistically non standard and leads to many subtle statistical problems that would need to be addressed. It is also difficult to convince readers that it must be done, and if it is done and makes a substantial difference then it indicates a flaw in either your data or your model.

I would recommend actually acquiring the X=0 data points and then doing a standard fit with the intercept.

So I did a test, I calculated the average residual over all models, and the average difference between the true intercept and the fitted intercept.

So the answer with the subtraction method is much lower to the true intercept, and also a lower residual, but I don't think I can trust the lower residual.

With this information in mind, do you still agree to include the intercept in the measured data points and fit normally with no restrictions?

Thank you so much, the advice I am getting has been very VERY helpful

FactChecker · May 7, 2018

fahraynk said:

With this information in mind, do you still agree to include the intercept in the measured data points and fit normally with no restrictions?

This is just the wrong direction to go in. If that is not a measured experimental data point, then it is a theoretical point. If it's a theoretical true value, why only include it once? Why not two or three times? Or 20 times? Pretending that the theoretical intercept is an experimental sample data point is the wrong way to go.
You should either give up the theoretical intercept and apply a regression model allowing it to freely determine the intercept, or you should use a model that forces an intercept of (0,7).

I would not recommend ignoring the theoretical known value, no matter how it changes the statistical results. I see no reason to prefer accurate statistical measures of a known invalid model. I would prefer modified statistical measures of a known valid model.

Would including the true/known y-intercept in my dataset be "overfitting"?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Attachments

Similar threads

Undergrad My basic understanding of set theory

Undergrad The problem of points

Graduate Expected numbers of cards of a last color remaining

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Graduate Probability puzzle

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect