B Comparing Approaches: Linear Regression of Y on X vs X on Y

  • Thread starter Thread starter FactChecker
  • Start date Start date
  • Tags Tags
    Axes Regression
Click For Summary
The discussion centers on the comparison between linear regression models of Y on X versus X on Y, emphasizing that the choice of model should depend on the intended use of the data and the minimization of the correct errors. It highlights that the regression of the noisier variable yields better estimates, as demonstrated through Monte Carlo simulations, where the model with the larger standard deviation produced more accurate regression coefficients. The conversation critiques the common assumption that independent variables have no error, suggesting that this can lead to biased estimates if violated. Ultimately, it argues for the importance of selecting the appropriate regression model based on the characteristics of the data rather than solely on minimizing squared errors. The insights underscore the complexity of regression analysis and the need for careful consideration of model assumptions.
  • #31
FactChecker said:
Maybe we are not talking about the same thing.
We are not. I am talking about the uncertainty in the measurements themselves. I.e. the “noise” standard deviation. Not the standard deviation of the dataset which includes both “noise” standard deviation and “signal”.
 
Physics news on Phys.org
  • #32
Dale said:
We are not. I am talking about the uncertainty in the measurements themselves. I.e. the “noise” standard deviation. Not the standard deviation of the dataset which includes both “noise” standard deviation and “signal”.
But the separation of the total variation between those two causes is not immediately apparent. The purpose of the regression is to try to separate out the two. How much variation is due strictly to a linear model, ##X=aY+b## and how much is due to the added random behavior, ##\epsilon##, giving the complete model, ##X=aY+b +\epsilon##. To separate them, it is necessary to find the best linear line that minimizes the ##\sum(x_i-\hat{x_i})^2## and assumes that the remaining variation is random. Because the linear regression of ##X=aY+b## minimizes that SSE, it is the best linear model, in the sense that it leaves the least variation to be caused by ##\epsilon##. Anything else is worse.

Once a linear model, ##X=aY+b +\epsilon##, is determined, it implies that the random term of X (which is ##\epsilon##) and the random term of Y in the associated model ##Y=X/a-b/a -\epsilon/a## (which is ##\epsilon/a##) are in the same proportions, (1:1/a), as the signal ranges of the two.
 
Last edited:
  • #33
FactChecker said:
But the separation of those two causes of variation is not immediately apparent. The purpose of the regression is to try to separate out the two. How much variation is due strictly to a linear model, ##X=aY+b## and how much is due to the added random behavior, ##\epsilon##, giving the complete model, ##X=aY+b +\epsilon##. To separate them, it is necessary to find the best linear line that minimizes the ##\sum(x_i-\hat{x_i})^2## and assumes that the remaining variation is random. Because the linear regression of ##X=aY+b## minimizes that SSE, it is the best linear model, in the sense that it leaves the least variation to be caused by ##\epsilon##. Anything else is worse.

Maybe this is related to the discussion?

In regression

The bias–variance decomposition forms the conceptual basis for regression regularization methods such as Lasso and ridge regression. Regularization methods introduce bias into the regression solution that can reduce variance considerably relative to the ordinary least squares (OLS) solution. Although the OLS solution provides non-biased regression estimates, the lower variance solutions produced by regularization techniques provide superior MSE performance.

https://en.wikipedia.org/wiki/Bias–variance_tradeoff#Bias–variance_decomposition_of_mean_squared_error

https://towardsdatascience.com/mse-and-bias-variance-decomposition-77449dd2ff55
 
Last edited:
  • Like
Likes FactChecker
  • #34
Jarvis323 said:
I don't think that it applies. Those articles talk about the problems of "overfitting" the data and "overtraining" neural networks. I think they are about methods to limit the number of terms in a regression so that it does not overfit the data. That is not our problem here. But I must admit that I don't really know anything about the subjects in those articles.
 
Last edited:
  • #35
FactChecker said:
I don't think that it applies. Those articles talk about the problems of "overfitting" the data and "overtraining" neural networks. I think they are about methods to limit the number of terms in a regression so that it does not overfit the data. That is not our problem here. But I must admit that I don't really know anything about the subjects in those articles.

It's a frustratingly confusing subject for me for some reason.

This image is helpful.

1629290311749.png


https://towardsdatascience.com/regularization-the-path-to-bias-variance-trade-off-b7a7088b4577

For linear regression, if the assumptions hold,

1629290454169.png


https://people.eecs.berkeley.edu/~jegonzal/assets/slides/linear_regression.pdf
 
Last edited:
  • #36
FactChecker said:
But the separation of the total variation between those two causes is not immediately apparent.
The uncertainty in the measurement is usually explicitly stated in the documentation from the manufacturer of a measurement device. Typically it is one of their specifications, so it usually is not too difficult to find.

FactChecker said:
Anything else is worse.
For a person that chose the handle "FactChecker" you sure seem reluctant to discuss or even acknowledge any facts that run contrary to your opinion.

Amongst actual statisticians, usually the preferred estimator is a minimum-variance unbiased estimator. That means, out of all of the unbiased estimators, the one with the minimum-variance. Typically any unbiased estimator is preferred over a biased estimator, even when the biased estimator has less variance. It is well-known that typically there are biased estimators with lower variance than the minimum-variance unbiased estimator, but those are not generally preferred. So your opinion that "anything else is worse" is not shared by the statistical community.
 
  • #37
Jarvis323 said:
Maybe this is related to the discussion?
Jarvis323 said:
For linear regression, if the assumptions hold,
Certainly that second one is directly related. In particular, the statement that the OLS is a minimum-variance unbiased estimator depends on the assumptions holding. The issue that I am identifying is the result of one of those assumptions not holding. In such a situation you get a strongly biased estimate, as shown above.
 
  • #38
Apparently, introducing multiplicative noise to x before OLS is equivalent to doing ridge regression, which is a form of regularized linear regression where the model is biased to reduce the variance. The intuition is that adding noise to x stretches out the distribution of x values, which in turn reduces the slope of the model.

http://madrury.github.io/jekyll/update/statistics/2017/08/12/noisy-regression.html

It is interesting I guess because as the slides I linked show, OLS doesn't really care about the variance in y so long as it's constant and normally distributed. But variance in x will lead to increased model bias (and reduce model variance)

So, for univariate OLS, the choice of using the lower or higher variance variable as the independent variable is actually an act of making a bias-variance tradeoff? And in general, you can always add noise to the independent variables if you want to, to increase bias/reduce model complexity/reduce model variance?
 
Last edited:
  • #39
Jarvis323 said:
So, for univariate OLS, the choice of using the lower or higher variance variable as the independent variable is actually an act of making a bias-variance tradeoff?
In certain circumstances deliberately allowing a small amount of bias can improve the overall error. But it has to be done in a controlled fashion because as you add data the variance decreases so you have to decrease the allowable bias to keep the minimum error condition. So this situation is an uncontrolled introduction of bias that can't be used to minimize total error.
 
  • #40
Dale said:
It is well-known that typically there are biased estimators with lower variance than the minimum-variance unbiased estimator, but those are not generally preferred. So your opinion that "anything else is worse" is not shared by the statistical community.
Maybe I should have said "anything else has a greater SSE and gives worse estimates to the sample values in the metric of SSE." I think that is undeniable.
 
Last edited:
  • #41
FactChecker said:
Maybe I should have said "anything else has a greater SSE and gives worse estimates to the sample values in the metric of SSE." I think that is undeniable.
I agree with that
 
  • #42
By minimizing the wrong SSE errors in the sample, like minimizing ##\sum (y_i-\hat{y_i})^2## rather than ##\sum (x_i-\hat{x_i})^2##, the difference can be very large. It depends on the relationship between the variables. On the other hand, I would assume that the known uncertainty in practical measurement instruments is within a reasonable limit. I do not think that it should normally be the driver in the decision of which linear regression to use.
I admit there might be exceptions, like the articles that are concentrating on the problems of overtraining neural networks and reducing the terms in regression with too many terms. Those are other subjects.
 
Last edited:
  • #43
FactChecker said:
I do not think that it should be the driver in the decision of which linear regression should be used.
Why not? Linear regression makes some assumptions. Why would you not use the violation of those assumptions as a driver in the decision? Particularly given the large biases that can be introduced by violating those assumptions. What other criteria could be a better driver?
 
  • #44
Dale said:
For a person that chose the handle "FactChecker" you sure seem reluctant to discuss or even acknowledge any facts that run contrary to your opinion.
I'm sorry if you think I am stepping on your toes, but getting the fundamentals right is important. To get a good regression estimate for X, I recommend the regression that will minimize ##\sum (x_i-\hat{x_i})^2##. That is all I have to say. I think it is basic and not rocket science. If you recommend something else, I am skeptical. That is all I have to say on this subject and will leave any further discussion to others.
 
  • #45
FactChecker said:
I'm sorry if you think I am stepping on your toes, but getting the fundamentals right is important
The way to determine if you are actually “getting the fundamentals right” is to check the facts. In this case you are ignoring a key part of the fundamentals: the assumptions. By ignoring the assumptions you are not in fact getting the fundamentals right, as the evidence above clearly shows.

FactChecker said:
To get a good regression estimate for X, I recommend the regression that will minimize ∑(xi−xi^)2. That is all I have to say.
And the evidence says that your recommendation does not always “give a good regression estimate”. By ignoring the assumptions and thus not getting the fundamentals right, your approach can introduce bias.

FactChecker said:
If you recommend something else, I am skeptical.
Why are you skeptical? The data is clear. Just check the facts
 
Last edited:
  • #46
Changing the variable of interest (to be estimated) just because some other variable has a larger measurement uncertainty is a bold thing to recommend. What if the two variables are not strongly correlated? What if there are several other variables in a multiple linear regression that have a larger measurement uncertainty? Which one would you pick as the dependent variable? You could be looking at the completely wrong variable. I would like to see any references that recommend changing the variable of interest.
If you want to go into advanced theories, please be careful that they apply and can reasonably be recommended.
 
Last edited:
  • #47
FactChecker said:
Changing the variable of interest just because some other variable has a larger measurement uncertainty is a bold thing to recommend.
It is easy to be bold when you have data to support it.

FactChecker said:
What if the two variables are not strongly correlated?
With the Monte Carlo the correlation was only about ##R^2=0.25##. Not sure if you consider that strongly correlated or not. If you do, then you are welcome to do your own Monte Carlo simulation and show that this is an issue.

FactChecker said:
What if there are several other variables in a multiple linear regression that have a larger measurement uncertainty? Which one would you pick as the dependent variable?
In that case you would need to use a full errors-in-variables model for your analysis.

FactChecker said:
I would like to see any references that recommend changing the variable of interest.
Switching the variables is the same as Deming regression with ##\delta \approx 0##. That is described here

https://www.amazon.com/dp/0486646858/?tag=pfamazon01-20
 
  • Like
Likes FactChecker
  • #48
FactChecker said:
But the separation of the total variation between those two causes is not immediately apparent. The purpose of the regression is to try to separate out the two. How much variation is due strictly to a linear model, ##X=aY+b## and how much is due to the added random behavior, ##\epsilon##, giving the complete model, ##X=aY+b +\epsilon##. To separate them, it is necessary to find the best linear line that minimizes the ##\sum(x_i-\hat{x_i})^2## and assumes that the remaining variation is random. Because the linear regression of ##X=aY+b## minimizes that SSE, it is the best linear model, in the sense that it leaves the least variation to be caused by ##\epsilon##. Anything else is worse.

Once a linear model, ##X=aY+b +\epsilon##, is determined, it implies that the random term of X (which is ##\epsilon##) and the random term of Y in the associated model ##Y=X/a-b/a -\epsilon/a## (which is ##\epsilon/a##) are in the same proportions, (1:1/a), as the signal ranges of the two.
When you approach the problem as an estimation problem, you assume that there is no variation in the data that isn't part of the random noise. Estimation theory is based on the idea that you know that your data comes from a certain parametric model, with some parameter. Dale's arguments make this assumption as well. In this case, the error in your model is by assumption strictly due to irreducible random noise and the number of data points you're working with. By assumption that you've got the right parametric model, and by assuming you trained with enough data points, you conlclude that the slope of your line will match the ground truth model that nature used to generate the data. And any bias in your model would be a deviation from that ground truth model. The goal is to find the right ##\theta##, not the best ##\hat{y}##. It just happens that with these assumptions, finding the best ##\theta## should also give you the best ##\hat{y}## and visa versa. So minimizing the SSE (which in practice can include error from model bias and model variance and from random noise in the data) is more like a means to an end, and depends on the assumption. It's an interesting case to look at what happens when ##x## is noisy. It actually seems to be an assumption hardly ever mentioned that ##x## doesn't have any noise, which is hardly ever true in the real world. And noise in ##x## biases the OLS solution (meaning it throws off your estimation of ##\theta##, even as you minimize the SSE, which is bad from an estimation theory perspective where finding ##\theta## is the goal).

But in the case you mention in above quote, it seems that it isn't strictly an estimation problem, so I think you should view it instead as a prediction problem, where your goal is to predict ##y## rather than find the "correct" ##\theta##. Because don't know even know what parametric distribution nature used if any. In this case, the goal is to minimize the error rather than to find the correct parameter. Since you don't know the true model, you need to be concerned about all of the terms in the bias-variance decomposition. This means you that you really do have a reason to make a pragmatic tradeoff between bias and variance in the model, which means underfitting (more bias) vs overfitting (less bias). And rather than rely on assumptions about the true model to gauge whether the model is good just from the training data, you have to estimate the error on separate test data. Usually you will tune the amount of bias you use inside of a cross validation loop, and choose the amount of bias that proves to perform better in practice.
 
Last edited:
  • Like
Likes Dale
  • #49
Dale said:
It is easy to be bold when you have data to support it.With the Monte Carlo the correlation was only about ##R^2=0.25##. Not sure if you consider that strongly correlated or not. If you do, then you are welcome to do your own Monte Carlo simulation and show that this is an issue.In that case you would need to use a full errors-in-variables model for your analysis.Switching the variables is the same as Deming regression with ##\delta \approx 0##. That is described here

https://www.amazon.com/dp/0486646858/?tag=pfamazon01-20
I am not sure that I would consider one MC simulation based on one model "data to support it". Maybe I would with a thousand simulations and some analysis of the model parameters involved to indicate what conditions imply certain conclusions. I can think of examples where switching the dependent variable to the one with the greater measurement uncertainty would not be recommended.
I appreciate the reference to Demming regression. I can see that specific values of ##\delta## give the two regressions and principal components.
 
  • #50
Jarvis323 said:
When you approach the problem as an estimation problem, you assume that there is no variation in the data that isn't part of the random noise.
Jarvis323 said:
But in the case you mention in above quote, it seems that it isn't strictly an estimation problem, so I think you should view it instead as a prediction problem, where your goal is to predict ##y## rather than find the "correct" ##\theta##.
I am not sure that I fully understand. The original post that started all this was here. It said, "I want to estimate the value of x from a certain given value of y." It seemed simple, but it has gotten beyond my level of expertise.
 
  • #51
FactChecker said:
I can think of examples where switching the dependent variable to the one with the greater measurement uncertainty would not be recommended.
As can I, which is why I have repeatedly stated that I do not advocate blind application of this approach. It is clearly beneficial in some circumstances, but I do not have a test or a formulaic method.

The key point is to recognize the often-ignored assumption in OLS regression that there is no measurement uncertainty in the independent variable. When that assumption is severely violated, standard OLS can produce biased estimates. If the dependent variable has little measurement uncertainty then a potential easy remedy is to switch the axes. All of those steps to be taken judiciously.
 
  • Like
Likes FactChecker
  • #52
Basically
FactChecker said:
I am not sure that I fully understand. The original post that started all this was here. It said, "I want to estimate the value of x from a certain given value of y." It seemed simple, but it has gotten beyond my level of expertise.

Imagine two somewhat extreme cases,

(1) you are a scientist and you are seeking explainable knowledge about the nature of an observable. Your goal is to find something which is very close to the true model that nature uses, and you don't care whether it will get you rich in the short term. Then you will use what you know and try to come up with a parametric family of models you think should be the right one. Maybe you use all kinds of scientific theory to guess what that should be, and you will carefully consider the assumptions, and you will try to choose a simple one using Occam's razor. So now you assume you've got the right one, and now you need to fit the model's parameters ##\theta## to the data. This is a clear case where you are doing parameter estimation. And this is fine, the only drawback is that you've basically got to know the system you're predicting down to a science. Scientists tend to like this approach for obvious reasons.

(2) You don't care much about science, or knowledge, and you either don't know, or don't care what the true model is that nature uses. You don't know how, and you either can't, or don't have time to figure out what assumptions you can rely on. You only care about getting results. Then you will probably just start with a complicated model, throwing every possible predictor at the problem you can think of as a candidate variable in the model (e.g. day of the week, temperature, day the groundhog woke up, the alignments of the planets, the lengths and titles of the headline news articles, etc., whatever you want/have). Then you try to simultaneously simplify the model, and minimize the error you get on test data. Simplifying the model usually means reducing the number of variables, combining the variables using PCA, or using regularization to force the weights of redundant or poor performing variables towards 0. The more you simplify the model, the more you are biasing it, and the more you reduce the variance of the model (which is how much the model changes if you train it on different subsets). So a more bias model is more stable/consistent. The reason that simplifying the model reduces the variance, is that each additional variable introduces its own noise. And a complex/unbiased model will have more freedom to fit to that noise. And because of the curse of dimensionality, the more variables you have, the greater the chance that the random noise alone is a perfect predictor of ##y##, and that increases rapidly. In fact, it can be demonstrated that, at some level of model complexity relative to sample size (common in neuroscience for example), you can make a perfect predictor of the training data using using only variables which are purely random noise with absolutely no true correlation with ##y##. For example, you could take each pixel on a TV which is displaying static noise to be a separate variable, and each frame is one data point. With a small enough amount of frames, your unbiased model will perfectly fit the price of the stock market (in your training data), but in your testing data it will have no predictive power at all. And if you had one additional variable that was a true predictor, it would be drowned out by all of the by chance correlations, and still you would get a terrible model. Since you have so many pixels, it will require a large amount of frames before your model starts fitting to that one true predictor rather than some combination of those random fluctuations.

One major problems with approach (2) is that you often end up with a model which you can't explain. It might also be disconnected from any established theory about the domain and too troublesome to try and integrate with theory. It also depends on having large amounts of data in order to avoid overfitting/underfitting, especially as you have more variables.

Scientists also end up using an approach closer to (2) more and more now days, because the first approach is hard, and some data scientists are interested in highly complex data that seems to have no simple model. And in some cases, people have decided that results matter even if we don't know why, like in clinical diagnosis, and in development of more efficient technology. Neuroscience is a good example. Now that people are throwing massive amounts of data such problems however, people are also trying to think about how to use this approach and still end up with an explainable mode (have their cake and eat it too).

In reality, people tend to use an approach somewhere in between. For example, you might know that the phases of the moon are unlikely to predict the stock market, so you will exclude it as a candidate variable from the start. Everything you know about which variables might be better predictors than others can help to reduce the number of variables up front, which in turn reduces the amount of the data you need and reduces the risk of fitting to noise.
 
Last edited:
  • Like
Likes FactChecker
  • #53
Dale said:
I used ##x## going from 0 to 1 in steps of 0.01, with a true regression of ##y=2 x + 5##.

It's interesting to consider what would happen in a simulation that represented an experiment where the experimenter attempted to vary ##x## in the same manner. It seems to me that a simulation of that experiment would be to vary the observed value of ##x## systematically and then pick the true value of ##x## by subtracting gaussian noise from that value.
 
  • Like
Likes FactChecker
  • #54
Stephen Tashi said:
It's interesting to consider what would happen in a simulation that represented an experiment where the experimenter attempted to vary ##x## in the same manner. It seems to me that a simulation of that experiment would be to vary the observed value of ##x## systematically and then pick the true value of ##x## by subtracting gaussian noise from that value.
IMHO, there are a lot of things to experiment with and analyze. The situation is much more complicated than just saying that the variable with the most measurement uncertainty should be the dependent variable. It requires some serious justification before the regression dependent variable should be changed from the normal one to be estimated.
 
  • #55
FactChecker said:
It requires some serious justification before the regression dependent variable should be changed from the normal one to be estimated
A violation of the assumptions of any method is a serious justification for using a different method.
 
  • #56
Dale said:
A violation of the assumptions of any method is a serious justification for using a different method.
Nope. The measurement uncertainty may very often be negligible compared to the physical random component of the process. In that case, switching the dependent variable in linear regression is the "tail waging the dog".
 
  • #57
FactChecker said:
. The measurement uncertainty may very often be negligible compared to the physical random component of the process.
Sure, then that is not a violation of the assumption.
 
  • #58
Dale said:
Sure, then that is not a violation of the assumption.
That was not your answer in the original post. You said:
"The important thing is which is measured/known most precisely. That should be the independent variable. The assumption of OLS regression is that all of the error is in the dependent variable."
That is just not true, and bad advice in general.
EDIT: It just occurred to me that there may be a significant cultural difference between fields where the main problem is in getting accurate measurements versus fields where the uncertainty of measurements is secondary to the effects of external random physical processes. In that first case, your advice may be the best thing to consider immediately. That was not true in my field. In my field, the external random influences dominated. (There were still significant problems with measurements in certain areas that were often dealt with using Kalman filters. I was not involved with that and do not know much about it.)
 
Last edited:
  • #59
A quantity that you cannot control well, but can measure precisely is still measured/known precisely in this context.
 
  • Like
Likes FactChecker
  • #60
Dale said:
A quantity that you cannot control well, but can measure precisely is still measured/known precisely in this context.
Yes. I think I am beginning to understand the different perspectives on this issue.
 
  • Like
Likes Dale

Similar threads

  • · Replies 8 ·
Replies
8
Views
3K
Replies
13
Views
4K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 30 ·
2
Replies
30
Views
4K
Replies
9
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 2 ·
Replies
2
Views
3K
Replies
9
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K