B Comparing Approaches: Linear Regression of Y on X vs X on Y

Dale · Aug 17, 2021

FactChecker said:

Maybe we are not talking about the same thing.

We are not. I am talking about the uncertainty in the measurements themselves. I.e. the “noise” standard deviation. Not the standard deviation of the dataset which includes both “noise” standard deviation and “signal”.

FactChecker · Aug 18, 2021

Dale said:

We are not. I am talking about the uncertainty in the measurements themselves. I.e. the “noise” standard deviation. Not the standard deviation of the dataset which includes both “noise” standard deviation and “signal”.

But the separation of the total variation between those two causes is not immediately apparent. The purpose of the regression is to try to separate out the two. How much variation is due strictly to a linear model, ##X=aY+b## and how much is due to the added random behavior, ##\epsilon##, giving the complete model, ##X=aY+b +\epsilon##. To separate them, it is necessary to find the best linear line that minimizes the ##\sum(x_i-\hat{x_i})^2## and assumes that the remaining variation is random. Because the linear regression of ##X=aY+b## minimizes that SSE, it is the best linear model, in the sense that it leaves the least variation to be caused by ##\epsilon##. Anything else is worse.

Once a linear model, ##X=aY+b +\epsilon##, is determined, it implies that the random term of X (which is ##\epsilon##) and the random term of Y in the associated model ##Y=X/a-b/a -\epsilon/a## (which is ##\epsilon/a##) are in the same proportions, (1:1/a), as the signal ranges of the two.

Jarvis323 · Aug 18, 2021

FactChecker said:

But the separation of those two causes of variation is not immediately apparent. The purpose of the regression is to try to separate out the two. How much variation is due strictly to a linear model, ##X=aY+b## and how much is due to the added random behavior, ##\epsilon##, giving the complete model, ##X=aY+b +\epsilon##. To separate them, it is necessary to find the best linear line that minimizes the ##\sum(x_i-\hat{x_i})^2## and assumes that the remaining variation is random. Because the linear regression of ##X=aY+b## minimizes that SSE, it is the best linear model, in the sense that it leaves the least variation to be caused by ##\epsilon##. Anything else is worse.

Maybe this is related to the discussion?

In regression

The bias–variance decomposition forms the conceptual basis for regression regularization methods such as Lasso and ridge regression. Regularization methods introduce bias into the regression solution that can reduce variance considerably relative to the ordinary least squares (OLS) solution. Although the OLS solution provides non-biased regression estimates, the lower variance solutions produced by regularization techniques provide superior MSE performance.

https://en.wikipedia.org/wiki/Bias–variance_tradeoff#Bias–variance_decomposition_of_mean_squared_error

https://towardsdatascience.com/mse-and-bias-variance-decomposition-77449dd2ff55

FactChecker · Aug 18, 2021

Jarvis323 said:

Maybe this is related to the discussion?
https://en.wikipedia.org/wiki/Bias–variance_tradeoff#Bias–variance_decomposition_of_mean_squared_error
https://towardsdatascience.com/mse-and-bias-variance-decomposition-77449dd2ff55

I don't think that it applies. Those articles talk about the problems of "overfitting" the data and "overtraining" neural networks. I think they are about methods to limit the number of terms in a regression so that it does not overfit the data. That is not our problem here. But I must admit that I don't really know anything about the subjects in those articles.

Jarvis323 · Aug 18, 2021

FactChecker said:

I don't think that it applies. Those articles talk about the problems of "overfitting" the data and "overtraining" neural networks. I think they are about methods to limit the number of terms in a regression so that it does not overfit the data. That is not our problem here. But I must admit that I don't really know anything about the subjects in those articles.

It's a frustratingly confusing subject for me for some reason.

This image is helpful.

https://towardsdatascience.com/regularization-the-path-to-bias-variance-trade-off-b7a7088b4577

For linear regression, if the assumptions hold,

https://people.eecs.berkeley.edu/~jegonzal/assets/slides/linear_regression.pdf

Dale · Aug 18, 2021

FactChecker said:

But the separation of the total variation between those two causes is not immediately apparent.

The uncertainty in the measurement is usually explicitly stated in the documentation from the manufacturer of a measurement device. Typically it is one of their specifications, so it usually is not too difficult to find.

FactChecker said:

Anything else is worse.

For a person that chose the handle "FactChecker" you sure seem reluctant to discuss or even acknowledge any facts that run contrary to your opinion.

Amongst actual statisticians, usually the preferred estimator is a minimum-variance unbiased estimator. That means, out of all of the unbiased estimators, the one with the minimum-variance. Typically any unbiased estimator is preferred over a biased estimator, even when the biased estimator has less variance. It is well-known that typically there are biased estimators with lower variance than the minimum-variance unbiased estimator, but those are not generally preferred. So your opinion that "anything else is worse" is not shared by the statistical community.

Dale · Aug 18, 2021

Jarvis323 said:

Maybe this is related to the discussion?

Jarvis323 said:

For linear regression, if the assumptions hold,

Certainly that second one is directly related. In particular, the statement that the OLS is a minimum-variance unbiased estimator depends on the assumptions holding. The issue that I am identifying is the result of one of those assumptions not holding. In such a situation you get a strongly biased estimate, as shown above.

Jarvis323 · Aug 18, 2021

Apparently, introducing multiplicative noise to x before OLS is equivalent to doing ridge regression, which is a form of regularized linear regression where the model is biased to reduce the variance. The intuition is that adding noise to x stretches out the distribution of x values, which in turn reduces the slope of the model.

http://madrury.github.io/jekyll/update/statistics/2017/08/12/noisy-regression.html

It is interesting I guess because as the slides I linked show, OLS doesn't really care about the variance in y so long as it's constant and normally distributed. But variance in x will lead to increased model bias (and reduce model variance)

So, for univariate OLS, the choice of using the lower or higher variance variable as the independent variable is actually an act of making a bias-variance tradeoff? And in general, you can always add noise to the independent variables if you want to, to increase bias/reduce model complexity/reduce model variance?

Dale · Aug 18, 2021

Jarvis323 said:

So, for univariate OLS, the choice of using the lower or higher variance variable as the independent variable is actually an act of making a bias-variance tradeoff?

In certain circumstances deliberately allowing a small amount of bias can improve the overall error. But it has to be done in a controlled fashion because as you add data the variance decreases so you have to decrease the allowable bias to keep the minimum error condition. So this situation is an uncontrolled introduction of bias that can't be used to minimize total error.

FactChecker · Aug 18, 2021

Dale said:

It is well-known that typically there are biased estimators with lower variance than the minimum-variance unbiased estimator, but those are not generally preferred. So your opinion that "anything else is worse" is not shared by the statistical community.

Maybe I should have said "anything else has a greater SSE and gives worse estimates to the sample values in the metric of SSE." I think that is undeniable.

Dale · Aug 18, 2021

FactChecker said:

Maybe I should have said "anything else has a greater SSE and gives worse estimates to the sample values in the metric of SSE." I think that is undeniable.

I agree with that

FactChecker · Aug 18, 2021

By minimizing the wrong SSE errors in the sample, like minimizing ##\sum (y_i-\hat{y_i})^2## rather than ##\sum (x_i-\hat{x_i})^2##, the difference can be very large. It depends on the relationship between the variables. On the other hand, I would assume that the known uncertainty in practical measurement instruments is within a reasonable limit. I do not think that it should normally be the driver in the decision of which linear regression to use.
I admit there might be exceptions, like the articles that are concentrating on the problems of overtraining neural networks and reducing the terms in regression with too many terms. Those are other subjects.

Dale · Aug 18, 2021

FactChecker said:

I do not think that it should be the driver in the decision of which linear regression should be used.

Why not? Linear regression makes some assumptions. Why would you not use the violation of those assumptions as a driver in the decision? Particularly given the large biases that can be introduced by violating those assumptions. What other criteria could be a better driver?

FactChecker · Aug 18, 2021

Dale said:

For a person that chose the handle "FactChecker" you sure seem reluctant to discuss or even acknowledge any facts that run contrary to your opinion.

I'm sorry if you think I am stepping on your toes, but getting the fundamentals right is important. To get a good regression estimate for X, I recommend the regression that will minimize ##\sum (x_i-\hat{x_i})^2##. That is all I have to say. I think it is basic and not rocket science. If you recommend something else, I am skeptical. That is all I have to say on this subject and will leave any further discussion to others.

Dale · Aug 18, 2021

FactChecker said:

I'm sorry if you think I am stepping on your toes, but getting the fundamentals right is important

The way to determine if you are actually “getting the fundamentals right” is to check the facts. In this case you are ignoring a key part of the fundamentals: the assumptions. By ignoring the assumptions you are not in fact getting the fundamentals right, as the evidence above clearly shows.

FactChecker said:

To get a good regression estimate for X, I recommend the regression that will minimize ∑(xi−xi^)2. That is all I have to say.

And the evidence says that your recommendation does not always “give a good regression estimate”. By ignoring the assumptions and thus not getting the fundamentals right, your approach can introduce bias.

FactChecker said:

If you recommend something else, I am skeptical.

Why are you skeptical? The data is clear. Just check the facts

FactChecker · Aug 18, 2021

Changing the variable of interest (to be estimated) just because some other variable has a larger measurement uncertainty is a bold thing to recommend. What if the two variables are not strongly correlated? What if there are several other variables in a multiple linear regression that have a larger measurement uncertainty? Which one would you pick as the dependent variable? You could be looking at the completely wrong variable. I would like to see any references that recommend changing the variable of interest.
If you want to go into advanced theories, please be careful that they apply and can reasonably be recommended.

Dale · Aug 18, 2021

FactChecker said:

Changing the variable of interest just because some other variable has a larger measurement uncertainty is a bold thing to recommend.

It is easy to be bold when you have data to support it.

FactChecker said:

What if the two variables are not strongly correlated?

With the Monte Carlo the correlation was only about ##R^2=0.25##. Not sure if you consider that strongly correlated or not. If you do, then you are welcome to do your own Monte Carlo simulation and show that this is an issue.

FactChecker said:

What if there are several other variables in a multiple linear regression that have a larger measurement uncertainty? Which one would you pick as the dependent variable?

In that case you would need to use a full errors-in-variables model for your analysis.

FactChecker said:

I would like to see any references that recommend changing the variable of interest.

Switching the variables is the same as Deming regression with ##\delta \approx 0##. That is described here

https://www.amazon.com/dp/0486646858/?tag=pfamazon01-20

Jarvis323 · Aug 18, 2021

FactChecker said:

But the separation of the total variation between those two causes is not immediately apparent. The purpose of the regression is to try to separate out the two. How much variation is due strictly to a linear model, ##X=aY+b## and how much is due to the added random behavior, ##\epsilon##, giving the complete model, ##X=aY+b +\epsilon##. To separate them, it is necessary to find the best linear line that minimizes the ##\sum(x_i-\hat{x_i})^2## and assumes that the remaining variation is random. Because the linear regression of ##X=aY+b## minimizes that SSE, it is the best linear model, in the sense that it leaves the least variation to be caused by ##\epsilon##. Anything else is worse.

Once a linear model, ##X=aY+b +\epsilon##, is determined, it implies that the random term of X (which is ##\epsilon##) and the random term of Y in the associated model ##Y=X/a-b/a -\epsilon/a## (which is ##\epsilon/a##) are in the same proportions, (1:1/a), as the signal ranges of the two.

When you approach the problem as an estimation problem, you assume that there is no variation in the data that isn't part of the random noise. Estimation theory is based on the idea that you know that your data comes from a certain parametric model, with some parameter. Dale's arguments make this assumption as well. In this case, the error in your model is by assumption strictly due to irreducible random noise and the number of data points you're working with. By assumption that you've got the right parametric model, and by assuming you trained with enough data points, you conlclude that the slope of your line will match the ground truth model that nature used to generate the data. And any bias in your model would be a deviation from that ground truth model. The goal is to find the right ##\theta##, not the best ##\hat{y}##. It just happens that with these assumptions, finding the best ##\theta## should also give you the best ##\hat{y}## and visa versa. So minimizing the SSE (which in practice can include error from model bias and model variance and from random noise in the data) is more like a means to an end, and depends on the assumption. It's an interesting case to look at what happens when ##x## is noisy. It actually seems to be an assumption hardly ever mentioned that ##x## doesn't have any noise, which is hardly ever true in the real world. And noise in ##x## biases the OLS solution (meaning it throws off your estimation of ##\theta##, even as you minimize the SSE, which is bad from an estimation theory perspective where finding ##\theta## is the goal).

But in the case you mention in above quote, it seems that it isn't strictly an estimation problem, so I think you should view it instead as a prediction problem, where your goal is to predict ##y## rather than find the "correct" ##\theta##. Because don't know even know what parametric distribution nature used if any. In this case, the goal is to minimize the error rather than to find the correct parameter. Since you don't know the true model, you need to be concerned about all of the terms in the bias-variance decomposition. This means you that you really do have a reason to make a pragmatic tradeoff between bias and variance in the model, which means underfitting (more bias) vs overfitting (less bias). And rather than rely on assumptions about the true model to gauge whether the model is good just from the training data, you have to estimate the error on separate test data. Usually you will tune the amount of bias you use inside of a cross validation loop, and choose the amount of bias that proves to perform better in practice.

FactChecker · Aug 18, 2021

Dale said:

It is easy to be bold when you have data to support it.With the Monte Carlo the correlation was only about ##R^2=0.25##. Not sure if you consider that strongly correlated or not. If you do, then you are welcome to do your own Monte Carlo simulation and show that this is an issue.In that case you would need to use a full errors-in-variables model for your analysis.Switching the variables is the same as Deming regression with ##\delta \approx 0##. That is described here

https://www.amazon.com/dp/0486646858/?tag=pfamazon01-20

I am not sure that I would consider one MC simulation based on one model "data to support it". Maybe I would with a thousand simulations and some analysis of the model parameters involved to indicate what conditions imply certain conclusions. I can think of examples where switching the dependent variable to the one with the greater measurement uncertainty would not be recommended.
I appreciate the reference to Demming regression. I can see that specific values of ##\delta## give the two regressions and principal components.

FactChecker · Aug 18, 2021

Jarvis323 said:

When you approach the problem as an estimation problem, you assume that there is no variation in the data that isn't part of the random noise.

Jarvis323 said:

But in the case you mention in above quote, it seems that it isn't strictly an estimation problem, so I think you should view it instead as a prediction problem, where your goal is to predict ##y## rather than find the "correct" ##\theta##.

I am not sure that I fully understand. The original post that started all this was here. It said, "I want to estimate the value of x from a certain given value of y." It seemed simple, but it has gotten beyond my level of expertise.

Dale · Aug 18, 2021

FactChecker said:

I can think of examples where switching the dependent variable to the one with the greater measurement uncertainty would not be recommended.

As can I, which is why I have repeatedly stated that I do not advocate blind application of this approach. It is clearly beneficial in some circumstances, but I do not have a test or a formulaic method.

The key point is to recognize the often-ignored assumption in OLS regression that there is no measurement uncertainty in the independent variable. When that assumption is severely violated, standard OLS can produce biased estimates. If the dependent variable has little measurement uncertainty then a potential easy remedy is to switch the axes. All of those steps to be taken judiciously.

Jarvis323 · Aug 18, 2021

Basically

FactChecker said:

I am not sure that I fully understand. The original post that started all this was here. It said, "I want to estimate the value of x from a certain given value of y." It seemed simple, but it has gotten beyond my level of expertise.

Imagine two somewhat extreme cases,

(1) you are a scientist and you are seeking explainable knowledge about the nature of an observable. Your goal is to find something which is very close to the true model that nature uses, and you don't care whether it will get you rich in the short term. Then you will use what you know and try to come up with a parametric family of models you think should be the right one. Maybe you use all kinds of scientific theory to guess what that should be, and you will carefully consider the assumptions, and you will try to choose a simple one using Occam's razor. So now you assume you've got the right one, and now you need to fit the model's parameters ##\theta## to the data. This is a clear case where you are doing parameter estimation. And this is fine, the only drawback is that you've basically got to know the system you're predicting down to a science. Scientists tend to like this approach for obvious reasons.

(2) You don't care much about science, or knowledge, and you either don't know, or don't care what the true model is that nature uses. You don't know how, and you either can't, or don't have time to figure out what assumptions you can rely on. You only care about getting results. Then you will probably just start with a complicated model, throwing every possible predictor at the problem you can think of as a candidate variable in the model (e.g. day of the week, temperature, day the groundhog woke up, the alignments of the planets, the lengths and titles of the headline news articles, etc., whatever you want/have). Then you try to simultaneously simplify the model, and minimize the error you get on test data. Simplifying the model usually means reducing the number of variables, combining the variables using PCA, or using regularization to force the weights of redundant or poor performing variables towards 0. The more you simplify the model, the more you are biasing it, and the more you reduce the variance of the model (which is how much the model changes if you train it on different subsets). So a more bias model is more stable/consistent. The reason that simplifying the model reduces the variance, is that each additional variable introduces its own noise. And a complex/unbiased model will have more freedom to fit to that noise. And because of the curse of dimensionality, the more variables you have, the greater the chance that the random noise alone is a perfect predictor of ##y##, and that increases rapidly. In fact, it can be demonstrated that, at some level of model complexity relative to sample size (common in neuroscience for example), you can make a perfect predictor of the training data using using only variables which are purely random noise with absolutely no true correlation with ##y##. For example, you could take each pixel on a TV which is displaying static noise to be a separate variable, and each frame is one data point. With a small enough amount of frames, your unbiased model will perfectly fit the price of the stock market (in your training data), but in your testing data it will have no predictive power at all. And if you had one additional variable that was a true predictor, it would be drowned out by all of the by chance correlations, and still you would get a terrible model. Since you have so many pixels, it will require a large amount of frames before your model starts fitting to that one true predictor rather than some combination of those random fluctuations.

One major problems with approach (2) is that you often end up with a model which you can't explain. It might also be disconnected from any established theory about the domain and too troublesome to try and integrate with theory. It also depends on having large amounts of data in order to avoid overfitting/underfitting, especially as you have more variables.

Scientists also end up using an approach closer to (2) more and more now days, because the first approach is hard, and some data scientists are interested in highly complex data that seems to have no simple model. And in some cases, people have decided that results matter even if we don't know why, like in clinical diagnosis, and in development of more efficient technology. Neuroscience is a good example. Now that people are throwing massive amounts of data such problems however, people are also trying to think about how to use this approach and still end up with an explainable mode (have their cake and eat it too).

In reality, people tend to use an approach somewhere in between. For example, you might know that the phases of the moon are unlikely to predict the stock market, so you will exclude it as a candidate variable from the start. Everything you know about which variables might be better predictors than others can help to reduce the number of variables up front, which in turn reduces the amount of the data you need and reduces the risk of fitting to noise.

Stephen Tashi · Aug 19, 2021

Dale said:

I used ##x## going from 0 to 1 in steps of 0.01, with a true regression of ##y=2 x + 5##.

It's interesting to consider what would happen in a simulation that represented an experiment where the experimenter attempted to vary ##x## in the same manner. It seems to me that a simulation of that experiment would be to vary the observed value of ##x## systematically and then pick the true value of ##x## by subtracting gaussian noise from that value.

FactChecker · Aug 19, 2021

Stephen Tashi said:

It's interesting to consider what would happen in a simulation that represented an experiment where the experimenter attempted to vary ##x## in the same manner. It seems to me that a simulation of that experiment would be to vary the observed value of ##x## systematically and then pick the true value of ##x## by subtracting gaussian noise from that value.

IMHO, there are a lot of things to experiment with and analyze. The situation is much more complicated than just saying that the variable with the most measurement uncertainty should be the dependent variable. It requires some serious justification before the regression dependent variable should be changed from the normal one to be estimated.

Dale · Aug 19, 2021

FactChecker said:

It requires some serious justification before the regression dependent variable should be changed from the normal one to be estimated

A violation of the assumptions of any method is a serious justification for using a different method.

FactChecker · Aug 19, 2021

Dale said:

A violation of the assumptions of any method is a serious justification for using a different method.

Nope. The measurement uncertainty may very often be negligible compared to the physical random component of the process. In that case, switching the dependent variable in linear regression is the "tail waging the dog".

Dale · Aug 19, 2021

FactChecker said:

. The measurement uncertainty may very often be negligible compared to the physical random component of the process.

Sure, then that is not a violation of the assumption.

FactChecker · Aug 19, 2021

Dale said:

Sure, then that is not a violation of the assumption.

That was not your answer in the original post. You said:
"The important thing is which is measured/known most precisely. That should be the independent variable. The assumption of OLS regression is that all of the error is in the dependent variable."
~~That is just not true, and bad advice in general.~~
EDIT: It just occurred to me that there may be a significant cultural difference between fields where the main problem is in getting accurate measurements versus fields where the uncertainty of measurements is secondary to the effects of external random physical processes. In that first case, your advice may be the best thing to consider immediately. That was not true in my field. In my field, the external random influences dominated. (There were still significant problems with measurements in certain areas that were often dealt with using Kalman filters. I was not involved with that and do not know much about it.)

Dale · Aug 19, 2021

A quantity that you cannot control well, but can measure precisely is still measured/known precisely in this context.

FactChecker · Aug 19, 2021

Dale said:

A quantity that you cannot control well, but can measure precisely is still measured/known precisely in this context.

Yes. I think I am beginning to understand the different perspectives on this issue.

B Comparing Approaches: Linear Regression of Y on X vs X on Y

In regression

Similar threads

B A Little Probability Puzzle

I A variant of the Monty Hall problem

I What Are the Axioms of Fuzzy Logic and How Do They Extend Boolean Algebra?

I Please Explain (actually explain) The Monty Hall Problem

B How Rare Is Low Smartphone Usage Among Metro Travelers in Japan?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers