B Comparing Approaches: Linear Regression of Y on X vs X on Y

  • B
  • Thread starter Thread starter FactChecker
  • Start date Start date
  • Tags Tags
    Axes Regression
  • #51
FactChecker said:
I can think of examples where switching the dependent variable to the one with the greater measurement uncertainty would not be recommended.
As can I, which is why I have repeatedly stated that I do not advocate blind application of this approach. It is clearly beneficial in some circumstances, but I do not have a test or a formulaic method.

The key point is to recognize the often-ignored assumption in OLS regression that there is no measurement uncertainty in the independent variable. When that assumption is severely violated, standard OLS can produce biased estimates. If the dependent variable has little measurement uncertainty then a potential easy remedy is to switch the axes. All of those steps to be taken judiciously.
 
  • Like
Likes FactChecker
Physics news on Phys.org
  • #52
Basically
FactChecker said:
I am not sure that I fully understand. The original post that started all this was here. It said, "I want to estimate the value of x from a certain given value of y." It seemed simple, but it has gotten beyond my level of expertise.

Imagine two somewhat extreme cases,

(1) you are a scientist and you are seeking explainable knowledge about the nature of an observable. Your goal is to find something which is very close to the true model that nature uses, and you don't care whether it will get you rich in the short term. Then you will use what you know and try to come up with a parametric family of models you think should be the right one. Maybe you use all kinds of scientific theory to guess what that should be, and you will carefully consider the assumptions, and you will try to choose a simple one using Occam's razor. So now you assume you've got the right one, and now you need to fit the model's parameters ##\theta## to the data. This is a clear case where you are doing parameter estimation. And this is fine, the only drawback is that you've basically got to know the system you're predicting down to a science. Scientists tend to like this approach for obvious reasons.

(2) You don't care much about science, or knowledge, and you either don't know, or don't care what the true model is that nature uses. You don't know how, and you either can't, or don't have time to figure out what assumptions you can rely on. You only care about getting results. Then you will probably just start with a complicated model, throwing every possible predictor at the problem you can think of as a candidate variable in the model (e.g. day of the week, temperature, day the groundhog woke up, the alignments of the planets, the lengths and titles of the headline news articles, etc., whatever you want/have). Then you try to simultaneously simplify the model, and minimize the error you get on test data. Simplifying the model usually means reducing the number of variables, combining the variables using PCA, or using regularization to force the weights of redundant or poor performing variables towards 0. The more you simplify the model, the more you are biasing it, and the more you reduce the variance of the model (which is how much the model changes if you train it on different subsets). So a more bias model is more stable/consistent. The reason that simplifying the model reduces the variance, is that each additional variable introduces its own noise. And a complex/unbiased model will have more freedom to fit to that noise. And because of the curse of dimensionality, the more variables you have, the greater the chance that the random noise alone is a perfect predictor of ##y##, and that increases rapidly. In fact, it can be demonstrated that, at some level of model complexity relative to sample size (common in neuroscience for example), you can make a perfect predictor of the training data using using only variables which are purely random noise with absolutely no true correlation with ##y##. For example, you could take each pixel on a TV which is displaying static noise to be a separate variable, and each frame is one data point. With a small enough amount of frames, your unbiased model will perfectly fit the price of the stock market (in your training data), but in your testing data it will have no predictive power at all. And if you had one additional variable that was a true predictor, it would be drowned out by all of the by chance correlations, and still you would get a terrible model. Since you have so many pixels, it will require a large amount of frames before your model starts fitting to that one true predictor rather than some combination of those random fluctuations.

One major problems with approach (2) is that you often end up with a model which you can't explain. It might also be disconnected from any established theory about the domain and too troublesome to try and integrate with theory. It also depends on having large amounts of data in order to avoid overfitting/underfitting, especially as you have more variables.

Scientists also end up using an approach closer to (2) more and more now days, because the first approach is hard, and some data scientists are interested in highly complex data that seems to have no simple model. And in some cases, people have decided that results matter even if we don't know why, like in clinical diagnosis, and in development of more efficient technology. Neuroscience is a good example. Now that people are throwing massive amounts of data such problems however, people are also trying to think about how to use this approach and still end up with an explainable mode (have their cake and eat it too).

In reality, people tend to use an approach somewhere in between. For example, you might know that the phases of the moon are unlikely to predict the stock market, so you will exclude it as a candidate variable from the start. Everything you know about which variables might be better predictors than others can help to reduce the number of variables up front, which in turn reduces the amount of the data you need and reduces the risk of fitting to noise.
 
Last edited:
  • Like
Likes FactChecker
  • #53
Dale said:
I used ##x## going from 0 to 1 in steps of 0.01, with a true regression of ##y=2 x + 5##.

It's interesting to consider what would happen in a simulation that represented an experiment where the experimenter attempted to vary ##x## in the same manner. It seems to me that a simulation of that experiment would be to vary the observed value of ##x## systematically and then pick the true value of ##x## by subtracting gaussian noise from that value.
 
  • Like
Likes FactChecker
  • #54
Stephen Tashi said:
It's interesting to consider what would happen in a simulation that represented an experiment where the experimenter attempted to vary ##x## in the same manner. It seems to me that a simulation of that experiment would be to vary the observed value of ##x## systematically and then pick the true value of ##x## by subtracting gaussian noise from that value.
IMHO, there are a lot of things to experiment with and analyze. The situation is much more complicated than just saying that the variable with the most measurement uncertainty should be the dependent variable. It requires some serious justification before the regression dependent variable should be changed from the normal one to be estimated.
 
  • #55
FactChecker said:
It requires some serious justification before the regression dependent variable should be changed from the normal one to be estimated
A violation of the assumptions of any method is a serious justification for using a different method.
 
  • #56
Dale said:
A violation of the assumptions of any method is a serious justification for using a different method.
Nope. The measurement uncertainty may very often be negligible compared to the physical random component of the process. In that case, switching the dependent variable in linear regression is the "tail waging the dog".
 
  • #57
FactChecker said:
. The measurement uncertainty may very often be negligible compared to the physical random component of the process.
Sure, then that is not a violation of the assumption.
 
  • #58
Dale said:
Sure, then that is not a violation of the assumption.
That was not your answer in the original post. You said:
"The important thing is which is measured/known most precisely. That should be the independent variable. The assumption of OLS regression is that all of the error is in the dependent variable."
That is just not true, and bad advice in general.
EDIT: It just occurred to me that there may be a significant cultural difference between fields where the main problem is in getting accurate measurements versus fields where the uncertainty of measurements is secondary to the effects of external random physical processes. In that first case, your advice may be the best thing to consider immediately. That was not true in my field. In my field, the external random influences dominated. (There were still significant problems with measurements in certain areas that were often dealt with using Kalman filters. I was not involved with that and do not know much about it.)
 
Last edited:
  • #59
A quantity that you cannot control well, but can measure precisely is still measured/known precisely in this context.
 
  • Like
Likes FactChecker
  • #60
Dale said:
A quantity that you cannot control well, but can measure precisely is still measured/known precisely in this context.
Yes. I think I am beginning to understand the different perspectives on this issue.
 
  • Like
Likes Dale
  • #61
Dale said:
The important thing is which is measured/known most precisely.
(Didn't actually find that quote but it is referenced in post #58.)
Alex has a set of experimental data of inputs X and outputs Y.
For the purposes of a further experiment, she wants to input a value x that would give the best chance of an output in the vicinity of y.
How should she choose the value of x?
 
  • #62
Personally, I would recommend a full Bayesian approach so that you can include all prior information on X and Y. Including any manufacturers information on the precision and accuracy of X and Y, any previous literature and expert opinion, and any physical relationship between them.

Then the posterior predictive distribution P(X|Y,y) would be what you want.
 
  • #63
haruspex said:
Alex has a set of experimental data of inputs X and outputs Y.
For the purposes of a further experiment, she wants to input a value x that would give the best chance of an output in the vicinity of y.
How should she choose the value of x?

That's almost a well defined mathematical problem. One thing that's missing is how Alex will pick the value of y. For example, would she pick it from a uniform distribution over an interval [y0, y1] where the experimental data exists? - or is she trying to predict a value of x that corresponds to a y-value that hasn't been observed yet?

Generally speaking, questions of the form "What is the probability of such-and-such given the observed data" require a Bayesian approach. Non-Bayesian approaches answer questions of the form "What is the probability of the observed data given that such-and-such is assumed to true?".
 
  • #64
Jarvis323 said:
Apparently, introducing multiplicative noise to x before OLS is equivalent to doing ridge regression, which is a form of regularized linear regression where the model is biased to reduce the variance. The intuition is that adding noise to x stretches out the distribution of x values, which in turn reduces the slope of the model.

http://madrury.github.io/jekyll/update/statistics/2017/08/12/noisy-regression.html

It is interesting I guess because as the slides I linked show, OLS doesn't really care about the variance in y so long as it's constant and normally distributed. But variance in x will lead to increased model bias (and reduce model variance)

So, for univariate OLS, the choice of using the lower or higher variance variable as the independent variable is actually an act of making a bias-variance tradeoff? And in general, you can always add noise to the independent variables if you want to, to increase bias/reduce model complexity/reduce model variance?
Are there "Reasonable" ways to check that the assumptions of Gauss-Markov : IID residuals with mean 0, pairwise independent and residuals independent of independent variables are met?
 
  • #65
Just an observation; not sure if it has been addressed: ( Not surprisingly) ,Unlike the purely geometric cases of a given line ## y=mx ## and its' 'Reciprocal' ##x= \frac {1}{m} y ## for which the product of the slopes is ## m \frac {1}{m} =1 ##, the regression lines have respective slopes ## r \frac { s_y}{s_x} ; r \frac { s_x}{ s_y} \neq 1 ## when ## r \neq \pm 1 ##. But maybe this is obvious.
 
Back
Top