B Comparing Approaches: Linear Regression of Y on X vs X on Y

  • B
  • Thread starter Thread starter FactChecker
  • Start date Start date
  • Tags Tags
    Axes Regression
FactChecker
Science Advisor
Homework Helper
Messages
9,253
Reaction score
4,568
Dale said:
Yes, this is correct.

@FactChecker can confirm, but I don’t think that he is disagreeing with me. He is just showing you why the two choices are not equivalent.
I do disagree. How accurately a variable can be measured is not the significant issue. The head/tail result of a coin toss can be measured with great accuracy but that does not make that result the independent variable. The decision of whether to model Y=aX+b+##\epsilon## versus X=a'Y+b'+##\epsilon##' is a matter of how you will use the data, what SSE you want to minimize, and whether you want the standard statistical theory and results to apply to your use. How the data will be used should determine which linear regression to do.
It's essential to be minimizing the correct errors. The regression of X as a linear function of Y is guaranteed to minimize ##\sum (x_i-\hat {x_i})^2##.
The two approaches are easy to compare. Just do both regressions and see which one has the smaller SSE for that sample using ##y_i## to estimate ##x_i##.
 
Last edited:
  • Like
Likes Dale
Physics news on Phys.org
FactChecker said:
How accurately a variable can be measured is not the significant issue. ...
The two approaches are easy to compare. Just do both regressions and see which one has the smaller SSE for that sample using ##y_i## to estimate ##x_i##.
Indeed, so let's do a Monte Carlo simulation and see. I used ##x## going from 0 to 1 in steps of 0.01, with a true regression of ##y=2 x + 5##. I added zero-mean Gaussian white noise to both x and y, and ran two linear regressions one for the model ##y= a x + b + \epsilon## and the other for the model ##x = a' y + b' + \epsilon## which I then inverted to get an estimate for ##y=a x + b##.

First, I set ##\sigma_x=0.01## and ##\sigma_y=0.5##. Then the first fit gave ##y=1.90 x + 5.03## and after the inversion the second fit gave ##y=3.54x+4.20##. In this case the first fit gave regression coefficients much closer to the true values.

Second, I set ##\sigma_x=0.5## and ##\sigma_y=0.01##. Then the first fit gave ##y=0.54 x + 5.78## and after the inversion the second fit gave ##y=1.61 x + 5.33##. In this case the second fit gave regression coefficients closer to the true values.

In both cases, the best regression was obtained when the "noisier" variable was the one modeled. So when ##\sigma_x > \sigma_y## the better model was ##x = a' y + b' + \epsilon##, even though the resulting fit had to be inverted to use as desired.
 
Last edited:
There are ultimately formulas that allow us to obtain E( Y|X) from E(X|Y). Maybe we can use them to estimate E( y|X=xo) from E(x| Y=yo)?
 
Dale said:
. In this case the second fit gave regression coefficients closer to the true values.
Is the problem to minimize the square error of the estimated regression coefficients? - or is it to minimize the square error of predictions of the given y data from the given x data?
 
  • Like
Likes FactChecker
Dale said:
Indeed, so let's do a Monte Carlo simulation and see. I used ##x## going from 0 to 1 in steps of 0.01, with a true regression of ##y=2 x + 5##. I added zero-mean Gaussian white noise to both x and y, and ran two linear regressions one for the model ##y= a x + b + \epsilon## and the other for the model ##x = a' y + b' + \epsilon## which I then inverted to get an estimate for ##y=a x + b##.

First, I set ##\sigma_x=0.01## and ##\sigma_y=0.5##. Then the first fit gave ##y=1.90 x + 5.03## and after the inversion the second fit gave ##y=3.54x+4.20##. In this case the first fit gave regression coefficients much closer to the true values.

Second, I set ##\sigma_x=0.5## and ##\sigma_y=0.01##. Then the first fit gave ##y=0.54 x + 5.78## and after the inversion the second fit gave ##y=1.61 x + 5.33##. In this case the second fit gave regression coefficients closer to the true values.

In both cases, the best regression was obtained when the "noisier" variable was the one modeled. So when ##\sigma_x > \sigma_y## the better model was ##x = a' y + b' + \epsilon##, even though the resulting fit had to be inverted to use as desired.
For a given simulated data set, ##(x_i, y_i)##, see which regression line gives the better SSE, ##\sum (x_i-\hat{x_i})^2)##. Unless the regression algorithm is flawed, it must be the line obtained from the X=a'Y+b' regression line because that is the minimization that the regression algorithm for X=a'Y+b' does. The Y=aX+b regression line is minimizing the wrong thing. It is minimizing ##\sum (y_i-\hat{y_i})^2)##.

PS. If you simulate a certain model and change the criteria for which approach is "best", then the test may be rigged so that the "better" result is of the simulated form.
 
WWGD said:
There are ultimately formulas that allow us to obtain E( Y|X) from E(X|Y). Maybe we can use them to estimate E( y|X=xo) from E(x| Y=yo)?
Just to be clear. For any given ##x_0## you may have little or no sample data at or near that value. So you must specify a model and a form of the model equation that allows you to use a large number of your sample data to get an estimate at ##x_0##. That is what you get from the linear regression line.
 
FactChecker said:
The Y=aX+b regression line is minimizing the wrong thing.
Why is that the wrong thing? If most of your errors are in Y then you get a better result minimizing that.
 
Dale said:
Why is that the wrong thing? If most of your errors are in Y then you get a better result minimizing that.
It's wrong because the goal is to estimate X. If you use a model that is worse (sometimes very much worse) on the sample data, then you can expect it to be worse for the intended use.
 
Dale said:
Why is that the wrong thing? If most of your errors are in Y then you get a better result minimizing that.

It's easy to agree with that on an intuitive level, but I think it's challenging to formulate that thought rigorously. How is the "result" quantified?

For example, if we assert ##y = Ax + B## and we mis-estimate ##A## by 0.5 and mis-estimate ##B## by 0.2 then is the result (0.5)(0.5) + (0.2)(0.2) ? - and is it the same result as mis-estimating ##A## by 0.2 and ##B## by 0.5 ?
 
  • #10
FactChecker said:
It's wrong because the goal is to estimate X. If you use a model that is worse (sometimes very much worse) on the sample data, then you can expect it to be worse for the intended use.
But it isn’t worse. See the Monte Carlo results above.
 
  • #11
Stephen Tashi said:
I think it's challenging to formulate that thought rigorously
I definitely agree with that. And this isn’t something that the usual diagnostics check.

The bigger point is that one of the assumptions of the OLS regression is that the independent variables have 0 error. In practice that is never true but “close enough” is fine. Sometimes you can get “close enough” by flipping your variables, and sometimes you need completely different techniques. But simply ignoring a large violation of this assumption can cause problems, as shown above.
 
  • #12
Dale said:
But it isn’t worse. See the Monte Carlo results above.
I read that post but did not see anything about how well the alternatives did at estimating x values. By the definition of the regression algorithm, the linear regression for the model X = aY+b will minimize ##\sum (x_i-\hat {x_i})^2##. I consider anything else to be worse. The other regression might appear better in some respects because your simulation model was of a matching form, but that is not a valid test.
 
  • #13
FactChecker said:
I read that post but did not see anything about how well the alternatives did at estimating x values
Often the goal is to estimate the model coefficients. Particularly when those coefficients have some known meaning.
 
  • #14
Dale said:
Often the goal is to estimate the model coefficients. Particularly when those coefficients have some known meaning.
It is probably an advantage to estimate the parameters of the correct model. You used a Y=aX+b+##\epsilon## model to generate data in a simulation and then the Y=aX+b linear regression performed better at parameter estimation. I would have to think about that. But for the OP data, is there any logical reason to pick that model, not even knowing what the data is from?
In general, if you are trying to get the line, ##\hat{X}=a'Y+b'##, that best estimates X based on Y from a set of ##(x_i,y_i)## data, then it is better to minimize the correct thing, which is ##\sum (x_i-\hat{x_i})^2##, not ##\sum (y_i-\hat{y_i})^2##.
 
Last edited:
  • Like
Likes Dale
  • #15
Hard to see in practice how there would be confusion around the dependent and independent variable

It is not necessarily the more volatile one that is LH- for example, if you have an individual stock and the S&P 500, you regress the stock return against the index return, even though its possible (but not likely) that the standard deviation of the stock is less than the index.
 
  • #16
FactChecker said:
In general, if you are trying to get the line, ##\hat{X}=a'Y+b'##, that best estimates X based on Y from a set of ##(x_i,y_i)## data, then it is better to minimize the correct thing, which is ##\sum (x_i-\hat{x_i})^2##, not ##\sum (y_i-\hat{y_i})^2##.
Again, you can just test that sort of claim by running a Monte Carlo simulation. So, similar to what I did before, consider the true values of ##y## going from 0 to 1 in steps of 0.01 and the true values of ##x=2y+5##. I then added 0 mean Gaussian white noise to ##x## and ##y## with ##\sigma_x=0.01## and ##\sigma_y=0.5##. Next I did two fits, a "forward" fit of ##x=a y + b + \epsilon## and an "inverse" fit of ##y = a' x + b'+ \epsilon## where the desired fit parameters were then determined by ##a=1/a'## and ##b=-b'/a'##. I repeated this process 10000 times.

So, if we look at the sum of square residuals on the data. We see that indeed as you have stated the forward fit has a substantially smaller sum of squared residuals to the data.
1629165429889.png


However, if we look at the sum of squares residuals to the true regression line we see a very different outcome
1629165572922.png

So the forward fit is closer to the data, but the inverse fit is closer to the true relationship in a least-squares sense. In other words, it is fitting to the noise rather than to the actual relationship.

More importantly, if we look at the fit parameters we see that for both the slope and the intercept parameters, the forward fit is rather strongly biased whereas the inverse fit parameters appear unbiased.
1629165769794.png

1629165804192.png


Finally, we can compare the fit lines with the true regression. Notice how reliably wrong the forward fit is.
1629165946302.png

So the forward fit is the "best estimate" only in one very narrow sense. However, that does not mean that it is generally a better choice.

The issue is that the narrow sense in which it is better relies on an assumption which is strongly violated because ##\sigma_y## is so large. With the violation of this assumption the usual fit is no longer an unbiased minimum varisnce estimator. It is therefore better to switch to the inverse model which does not violate the assumption. Even though the resulting fits are suboptimal in the narrow sense, they are better under a much broader set of criteria and importantly the parameter estimates are unbiased.

Another alternative is to use an "errors in variables" model that does not make the assumption that the "independent" variable has no errors. But as we see, when one variable approximately satisfies the assumption then you can use that one and a standard least-squares fit and then invert the model.
 
Last edited:
  • #17
This gives me food for thought. If you do not know ahead of time that the data came from a simulation of Y = aX+b, how can a person distinguish which is the best regression to use either for parameter estimation or for estimations of X from Y?
 
  • Like
Likes PhDeezNutz and Dale
  • #18
FactChecker said:
how can a person distinguish which is the best regression to use either for parameter estimation or for estimations of X from Y?
If one parameter (X in the recent example) has a small standard deviation and the other does not, then the accurate one should serve as the predictor. Again, fundamentally this is about checking the validity of the model assumptions.
 
  • Like
Likes PhDeezNutz
  • #19
Dale said:
If one parameter (X in the recent example) has a small standard deviation and the other does not, then the accurate one should serve as the predictor. Again, fundamentally this is about checking the validity of the model assumptions.
It surprises me that there is a significant difference. Modeling ##Y = aX+b+\epsilon## is the same as modeling ##X=(1/a)Y-b/a-\epsilon/a##, both linear regression problems. And it seems like the difference in the standard deviation is just a matter of the range of values and the units of measurement for the two variables, neither of which should really matter. So I don't see where one can be better due to that.

But I do see the advantage of minimizing the correct SSE, ##\sum (x_i-\hat{x_i})^2##.
 
Last edited:
  • #20
FactChecker said:
But I do see the advantage of minimizing the correct SSE, ##\sum (x_i-\hat{x_i})^2##.
How can you rationalize that claim after the above demonstration? The evidence shows that it is clearly disadvantageous.

The fact that the resulting estimates are biased is a death-knell. If a technique is unbiased but not minimum variance, then you simply need more data to get a good estimate and improve the variance. But if a technique is biased then no amount of additional data will fix it.

It is not "correct" to use a technique whose assumptions are violated, even if doing so minimizes some variance.

FactChecker said:
And it seems like the difference in the standard deviation is just a matter of the range of values and the units of measurement for the two variables, neither of which should really matter. So I don't see where one can be better due to that.
And yet, a Monte Carlo simulation easily shows that it is better. Assumptions are important in statistics.
 
  • #21
Dale said:
How can you rationalize that claim after the above demonstration? The evidence shows that it is clearly disadvantageous.

The fact that the resulting estimates are biased is a death-knell. If a technique is unbiased but not minimum variance, then you simply need more data to get a good estimate and improve the variance. But if a technique is biased then no amount of additional data will fix it.

It is not "correct" to use a technique whose assumptions are violated, even if doing so minimizes some variance.And yet, a Monte Carlo simulation easily shows that it is better. Assumptions are important in statistics.
Minimizing the correct errors is minimizing the correct errors. If your simulation analysis shows otherwise, then it is seriously flawed.
Your test did not analyze the effect of the range and units of measurement of the two variables. The decision of which regression to use should be agnostic of scale and units of measure. But I can mathematically see that both directly influence, perhaps dominate, your recommended decision.
 
  • #22
Suppose the true model is ##X=100Y+\epsilon##. Then for each ##(x_i, y_i, \epsilon_i)## we have ##x_i=100*y_i + \epsilon_i## and ##y_i = x_i/100 + \epsilon_i/100##. So clearly, the SD of the sample ##y_i##s is orders of magnitude smaller than the ##x_i##s. Your recommendation is that the regression should be with Y as the dependent variable and X independent. That will be minimizing the wrong SSE. I don't think that can be justified.
 
Last edited:
  • #23
FactChecker said:
Minimizing the correct errors is minimizing the correct errors.
How can you justify calling a biased minimization “correct”? What is “correct” about bias?

FactChecker said:
That will be minimizing the wrong SSE. I don't think that can be defended.
I am not defending that. As I said in post 11 I agree that it is challenging to formulate this idea rigorously. I don't know of a standard test for making this decision. So I do not advocate ignorant or blind decision. I am merely pointing out that the decision requires considering the validity of the "zero error" assumption. The result of that consideration may be that the "inverse" approach is actually the better choice.

FactChecker said:
The decision of which regression to use should be agnostic of scale and units of measure.
That is not generally true in statistics.
 
Last edited:
  • #24
Dale said:
How can you justify calling a biased minimization “correct”? What is “correct” about bias?I am not defending that. As I said in post 11 I agree that it is challenging to formulate this idea rigorously. I don't know of a standard test for making this decision. So I do not advocate ignorant or blind decision. I am merely pointing out that the decision requires considering the validity of the "zero error" assumption. The result of that consideration may be that the "inverse" approach is actually the better choice.That is not generally true in statistics.
Here is one consequence of your recommendation.
Suppose we have an experiment of temperatures versus associated positions. Your recommendation would likely change depending on whether the temperatures were measured in Fahrenheit or Celsius and whether the positions were measured in inches, feet, or yards.
I do not like that. If you think that is right, I guess we will just have to agree to disagree.
 
Last edited:
  • #25
FactChecker said:
Here is one consequence of your recommendation.
Suppose we have an experiment of temperatures versus associated positions. Your recommendation would likely change depending on whether the temperatures were measured in Fahrenheit or Celsius and whether the positions were measured in inches, feet, or yards.
I do not like that. If you think that is right, I guess we will just have to agree to disagree.
This is a strawman. As I have stated 3 times now it is challenging to formulate this issue rigorously, and I know of no formal test for it. So I am not advocating a blind rule like your strawman.

FactChecker said:
I guess we will just have to agree to disagree.
Ok, but the evidence is pretty clear: strong enough violations of the assumption will introduce bias. That much is not a matter of opinion.

The matter of opinion is only whether or not it is acceptable to choose a biased estimator when an unbiased estimator is available.
 
Last edited:
  • #26
Dale said:
This is a strawman.
Is it? It seems to me like a direct and practical implication of your recommendation. I have no reason to think that the OP is not an example of this.
Your recommendation basically says that we should prefer linear regression models ##Y=aX+b##, where ##a \gt 1##. Whereas I have no problem with ##a \lt 1##, especially if it means that the correct SSE is being minimized.
 
Last edited:
  • #27
FactChecker said:
Is it? It seems to me like a direct and practical implication of your recommendation.
Yes, as I clarified 3 times.

FactChecker said:
Your recommendation basically says that we should prefer linear regression models Y=aX+b, where a>1.
Even the blind application of exactly what I said doesn’t lead to that.
 
  • #28
Dale said:
Even the blind application of exactly what I said doesn’t lead to that.
Simplest example.
Suppose X=Y/a. Then Y=aX. ##SD_X = 1/a*SD_Y##. Your recommendation is to make the variable with the smallest SD the independent variable. So a>1 would make X the independent variable and the regression algorithm applied to Y = aX+b would minimize ##\sum(y_i-\hat{y_i})^2##. That is, you prefer a regression model of the form Y=aX+b, where a>1 even if it minimizes the wrong SSE for estimating X.
 
Last edited:
  • #29
FactChecker said:
Simplest example.
Obvious counterexample: The Monte Carlo simulation above. In that one the “inverse” model that produced the unbiased fit had a slope of 0.5

FactChecker said:
Your recommendation is to make the variable with the smallest SD the independent variable. So a>1 would make X the independent variable
That doesn’t follow at all.
 
  • #30
Dale said:
That doesn’t follow at all.
Maybe we are not talking about the same thing. For ##a\gt 1## in the example of Post #28, the basic properties of SD would give ##\sigma(Y)=\sigma(aX)=|a|\sigma(X) \gt \sigma(X)## (I am assuming we are not talking about the degenerative case of ##\sigma(X)##=0.)
 
  • #31
FactChecker said:
Maybe we are not talking about the same thing.
We are not. I am talking about the uncertainty in the measurements themselves. I.e. the “noise” standard deviation. Not the standard deviation of the dataset which includes both “noise” standard deviation and “signal”.
 
  • #32
Dale said:
We are not. I am talking about the uncertainty in the measurements themselves. I.e. the “noise” standard deviation. Not the standard deviation of the dataset which includes both “noise” standard deviation and “signal”.
But the separation of the total variation between those two causes is not immediately apparent. The purpose of the regression is to try to separate out the two. How much variation is due strictly to a linear model, ##X=aY+b## and how much is due to the added random behavior, ##\epsilon##, giving the complete model, ##X=aY+b +\epsilon##. To separate them, it is necessary to find the best linear line that minimizes the ##\sum(x_i-\hat{x_i})^2## and assumes that the remaining variation is random. Because the linear regression of ##X=aY+b## minimizes that SSE, it is the best linear model, in the sense that it leaves the least variation to be caused by ##\epsilon##. Anything else is worse.

Once a linear model, ##X=aY+b +\epsilon##, is determined, it implies that the random term of X (which is ##\epsilon##) and the random term of Y in the associated model ##Y=X/a-b/a -\epsilon/a## (which is ##\epsilon/a##) are in the same proportions, (1:1/a), as the signal ranges of the two.
 
Last edited:
  • #33
FactChecker said:
But the separation of those two causes of variation is not immediately apparent. The purpose of the regression is to try to separate out the two. How much variation is due strictly to a linear model, ##X=aY+b## and how much is due to the added random behavior, ##\epsilon##, giving the complete model, ##X=aY+b +\epsilon##. To separate them, it is necessary to find the best linear line that minimizes the ##\sum(x_i-\hat{x_i})^2## and assumes that the remaining variation is random. Because the linear regression of ##X=aY+b## minimizes that SSE, it is the best linear model, in the sense that it leaves the least variation to be caused by ##\epsilon##. Anything else is worse.

Maybe this is related to the discussion?

In regression

The bias–variance decomposition forms the conceptual basis for regression regularization methods such as Lasso and ridge regression. Regularization methods introduce bias into the regression solution that can reduce variance considerably relative to the ordinary least squares (OLS) solution. Although the OLS solution provides non-biased regression estimates, the lower variance solutions produced by regularization techniques provide superior MSE performance.

https://en.wikipedia.org/wiki/Bias–variance_tradeoff#Bias–variance_decomposition_of_mean_squared_error

https://towardsdatascience.com/mse-and-bias-variance-decomposition-77449dd2ff55
 
Last edited:
  • Like
Likes FactChecker
  • #34
Jarvis323 said:
I don't think that it applies. Those articles talk about the problems of "overfitting" the data and "overtraining" neural networks. I think they are about methods to limit the number of terms in a regression so that it does not overfit the data. That is not our problem here. But I must admit that I don't really know anything about the subjects in those articles.
 
Last edited:
  • #35
FactChecker said:
I don't think that it applies. Those articles talk about the problems of "overfitting" the data and "overtraining" neural networks. I think they are about methods to limit the number of terms in a regression so that it does not overfit the data. That is not our problem here. But I must admit that I don't really know anything about the subjects in those articles.

It's a frustratingly confusing subject for me for some reason.

This image is helpful.

1629290311749.png


https://towardsdatascience.com/regularization-the-path-to-bias-variance-trade-off-b7a7088b4577

For linear regression, if the assumptions hold,

1629290454169.png


https://people.eecs.berkeley.edu/~jegonzal/assets/slides/linear_regression.pdf
 
Last edited:
  • #36
FactChecker said:
But the separation of the total variation between those two causes is not immediately apparent.
The uncertainty in the measurement is usually explicitly stated in the documentation from the manufacturer of a measurement device. Typically it is one of their specifications, so it usually is not too difficult to find.

FactChecker said:
Anything else is worse.
For a person that chose the handle "FactChecker" you sure seem reluctant to discuss or even acknowledge any facts that run contrary to your opinion.

Amongst actual statisticians, usually the preferred estimator is a minimum-variance unbiased estimator. That means, out of all of the unbiased estimators, the one with the minimum-variance. Typically any unbiased estimator is preferred over a biased estimator, even when the biased estimator has less variance. It is well-known that typically there are biased estimators with lower variance than the minimum-variance unbiased estimator, but those are not generally preferred. So your opinion that "anything else is worse" is not shared by the statistical community.
 
  • #37
Jarvis323 said:
Maybe this is related to the discussion?
Jarvis323 said:
For linear regression, if the assumptions hold,
Certainly that second one is directly related. In particular, the statement that the OLS is a minimum-variance unbiased estimator depends on the assumptions holding. The issue that I am identifying is the result of one of those assumptions not holding. In such a situation you get a strongly biased estimate, as shown above.
 
  • #38
Apparently, introducing multiplicative noise to x before OLS is equivalent to doing ridge regression, which is a form of regularized linear regression where the model is biased to reduce the variance. The intuition is that adding noise to x stretches out the distribution of x values, which in turn reduces the slope of the model.

http://madrury.github.io/jekyll/update/statistics/2017/08/12/noisy-regression.html

It is interesting I guess because as the slides I linked show, OLS doesn't really care about the variance in y so long as it's constant and normally distributed. But variance in x will lead to increased model bias (and reduce model variance)

So, for univariate OLS, the choice of using the lower or higher variance variable as the independent variable is actually an act of making a bias-variance tradeoff? And in general, you can always add noise to the independent variables if you want to, to increase bias/reduce model complexity/reduce model variance?
 
Last edited:
  • #39
Jarvis323 said:
So, for univariate OLS, the choice of using the lower or higher variance variable as the independent variable is actually an act of making a bias-variance tradeoff?
In certain circumstances deliberately allowing a small amount of bias can improve the overall error. But it has to be done in a controlled fashion because as you add data the variance decreases so you have to decrease the allowable bias to keep the minimum error condition. So this situation is an uncontrolled introduction of bias that can't be used to minimize total error.
 
  • #40
Dale said:
It is well-known that typically there are biased estimators with lower variance than the minimum-variance unbiased estimator, but those are not generally preferred. So your opinion that "anything else is worse" is not shared by the statistical community.
Maybe I should have said "anything else has a greater SSE and gives worse estimates to the sample values in the metric of SSE." I think that is undeniable.
 
Last edited:
  • #41
FactChecker said:
Maybe I should have said "anything else has a greater SSE and gives worse estimates to the sample values in the metric of SSE." I think that is undeniable.
I agree with that
 
  • #42
By minimizing the wrong SSE errors in the sample, like minimizing ##\sum (y_i-\hat{y_i})^2## rather than ##\sum (x_i-\hat{x_i})^2##, the difference can be very large. It depends on the relationship between the variables. On the other hand, I would assume that the known uncertainty in practical measurement instruments is within a reasonable limit. I do not think that it should normally be the driver in the decision of which linear regression to use.
I admit there might be exceptions, like the articles that are concentrating on the problems of overtraining neural networks and reducing the terms in regression with too many terms. Those are other subjects.
 
Last edited:
  • #43
FactChecker said:
I do not think that it should be the driver in the decision of which linear regression should be used.
Why not? Linear regression makes some assumptions. Why would you not use the violation of those assumptions as a driver in the decision? Particularly given the large biases that can be introduced by violating those assumptions. What other criteria could be a better driver?
 
  • #44
Dale said:
For a person that chose the handle "FactChecker" you sure seem reluctant to discuss or even acknowledge any facts that run contrary to your opinion.
I'm sorry if you think I am stepping on your toes, but getting the fundamentals right is important. To get a good regression estimate for X, I recommend the regression that will minimize ##\sum (x_i-\hat{x_i})^2##. That is all I have to say. I think it is basic and not rocket science. If you recommend something else, I am skeptical. That is all I have to say on this subject and will leave any further discussion to others.
 
  • #45
FactChecker said:
I'm sorry if you think I am stepping on your toes, but getting the fundamentals right is important
The way to determine if you are actually “getting the fundamentals right” is to check the facts. In this case you are ignoring a key part of the fundamentals: the assumptions. By ignoring the assumptions you are not in fact getting the fundamentals right, as the evidence above clearly shows.

FactChecker said:
To get a good regression estimate for X, I recommend the regression that will minimize ∑(xi−xi^)2. That is all I have to say.
And the evidence says that your recommendation does not always “give a good regression estimate”. By ignoring the assumptions and thus not getting the fundamentals right, your approach can introduce bias.

FactChecker said:
If you recommend something else, I am skeptical.
Why are you skeptical? The data is clear. Just check the facts
 
Last edited:
  • #46
Changing the variable of interest (to be estimated) just because some other variable has a larger measurement uncertainty is a bold thing to recommend. What if the two variables are not strongly correlated? What if there are several other variables in a multiple linear regression that have a larger measurement uncertainty? Which one would you pick as the dependent variable? You could be looking at the completely wrong variable. I would like to see any references that recommend changing the variable of interest.
If you want to go into advanced theories, please be careful that they apply and can reasonably be recommended.
 
Last edited:
  • #47
FactChecker said:
Changing the variable of interest just because some other variable has a larger measurement uncertainty is a bold thing to recommend.
It is easy to be bold when you have data to support it.

FactChecker said:
What if the two variables are not strongly correlated?
With the Monte Carlo the correlation was only about ##R^2=0.25##. Not sure if you consider that strongly correlated or not. If you do, then you are welcome to do your own Monte Carlo simulation and show that this is an issue.

FactChecker said:
What if there are several other variables in a multiple linear regression that have a larger measurement uncertainty? Which one would you pick as the dependent variable?
In that case you would need to use a full errors-in-variables model for your analysis.

FactChecker said:
I would like to see any references that recommend changing the variable of interest.
Switching the variables is the same as Deming regression with ##\delta \approx 0##. That is described here

https://www.amazon.com/dp/0486646858/?tag=pfamazon01-20
 
  • Like
Likes FactChecker
  • #48
FactChecker said:
But the separation of the total variation between those two causes is not immediately apparent. The purpose of the regression is to try to separate out the two. How much variation is due strictly to a linear model, ##X=aY+b## and how much is due to the added random behavior, ##\epsilon##, giving the complete model, ##X=aY+b +\epsilon##. To separate them, it is necessary to find the best linear line that minimizes the ##\sum(x_i-\hat{x_i})^2## and assumes that the remaining variation is random. Because the linear regression of ##X=aY+b## minimizes that SSE, it is the best linear model, in the sense that it leaves the least variation to be caused by ##\epsilon##. Anything else is worse.

Once a linear model, ##X=aY+b +\epsilon##, is determined, it implies that the random term of X (which is ##\epsilon##) and the random term of Y in the associated model ##Y=X/a-b/a -\epsilon/a## (which is ##\epsilon/a##) are in the same proportions, (1:1/a), as the signal ranges of the two.
When you approach the problem as an estimation problem, you assume that there is no variation in the data that isn't part of the random noise. Estimation theory is based on the idea that you know that your data comes from a certain parametric model, with some parameter. Dale's arguments make this assumption as well. In this case, the error in your model is by assumption strictly due to irreducible random noise and the number of data points you're working with. By assumption that you've got the right parametric model, and by assuming you trained with enough data points, you conlclude that the slope of your line will match the ground truth model that nature used to generate the data. And any bias in your model would be a deviation from that ground truth model. The goal is to find the right ##\theta##, not the best ##\hat{y}##. It just happens that with these assumptions, finding the best ##\theta## should also give you the best ##\hat{y}## and visa versa. So minimizing the SSE (which in practice can include error from model bias and model variance and from random noise in the data) is more like a means to an end, and depends on the assumption. It's an interesting case to look at what happens when ##x## is noisy. It actually seems to be an assumption hardly ever mentioned that ##x## doesn't have any noise, which is hardly ever true in the real world. And noise in ##x## biases the OLS solution (meaning it throws off your estimation of ##\theta##, even as you minimize the SSE, which is bad from an estimation theory perspective where finding ##\theta## is the goal).

But in the case you mention in above quote, it seems that it isn't strictly an estimation problem, so I think you should view it instead as a prediction problem, where your goal is to predict ##y## rather than find the "correct" ##\theta##. Because don't know even know what parametric distribution nature used if any. In this case, the goal is to minimize the error rather than to find the correct parameter. Since you don't know the true model, you need to be concerned about all of the terms in the bias-variance decomposition. This means you that you really do have a reason to make a pragmatic tradeoff between bias and variance in the model, which means underfitting (more bias) vs overfitting (less bias). And rather than rely on assumptions about the true model to gauge whether the model is good just from the training data, you have to estimate the error on separate test data. Usually you will tune the amount of bias you use inside of a cross validation loop, and choose the amount of bias that proves to perform better in practice.
 
Last edited:
  • Like
Likes Dale
  • #49
Dale said:
It is easy to be bold when you have data to support it.With the Monte Carlo the correlation was only about ##R^2=0.25##. Not sure if you consider that strongly correlated or not. If you do, then you are welcome to do your own Monte Carlo simulation and show that this is an issue.In that case you would need to use a full errors-in-variables model for your analysis.Switching the variables is the same as Deming regression with ##\delta \approx 0##. That is described here

https://www.amazon.com/dp/0486646858/?tag=pfamazon01-20
I am not sure that I would consider one MC simulation based on one model "data to support it". Maybe I would with a thousand simulations and some analysis of the model parameters involved to indicate what conditions imply certain conclusions. I can think of examples where switching the dependent variable to the one with the greater measurement uncertainty would not be recommended.
I appreciate the reference to Demming regression. I can see that specific values of ##\delta## give the two regressions and principal components.
 
  • #50
Jarvis323 said:
When you approach the problem as an estimation problem, you assume that there is no variation in the data that isn't part of the random noise.
Jarvis323 said:
But in the case you mention in above quote, it seems that it isn't strictly an estimation problem, so I think you should view it instead as a prediction problem, where your goal is to predict ##y## rather than find the "correct" ##\theta##.
I am not sure that I fully understand. The original post that started all this was here. It said, "I want to estimate the value of x from a certain given value of y." It seemed simple, but it has gotten beyond my level of expertise.
 
Back
Top