Comparing Approaches: Linear Regression of Y on X vs X on Y

  • Context: High School 
  • Thread starter Thread starter FactChecker
  • Start date Start date
  • Tags Tags
    Axes Regression
Click For Summary
SUMMARY

The discussion centers on the comparison of linear regression models, specifically Y on X versus X on Y. Participants emphasize that the choice of regression model should depend on the intended use of the data and the minimization of the correct sum of squared errors (SSE). Monte Carlo simulations demonstrate that the model with the noisier variable as the dependent variable yields better regression coefficients. The consensus is that minimizing the SSE for the correct model is crucial for accurate estimations, particularly when the independent variable is subject to measurement error.

PREREQUISITES
  • Understanding of linear regression models, specifically Y = aX + b + ε and X = a'Y + b' + ε'
  • Familiarity with Monte Carlo simulation techniques for statistical analysis
  • Knowledge of sum of squared errors (SSE) and its implications in regression analysis
  • Concept of measurement error in independent and dependent variables
NEXT STEPS
  • Explore advanced linear regression techniques, including "errors in variables" models
  • Learn about Monte Carlo simulation methods for validating regression models
  • Investigate the implications of measurement error on regression coefficients
  • Study the differences between ordinary least squares (OLS) and generalized least squares (GLS) regression
USEFUL FOR

Data scientists, statisticians, and researchers involved in regression analysis and model validation, particularly those dealing with measurement errors in their datasets.

  • #61
Dale said:
The important thing is which is measured/known most precisely.
(Didn't actually find that quote but it is referenced in post #58.)
Alex has a set of experimental data of inputs X and outputs Y.
For the purposes of a further experiment, she wants to input a value x that would give the best chance of an output in the vicinity of y.
How should she choose the value of x?
 
Physics news on Phys.org
  • #62
Personally, I would recommend a full Bayesian approach so that you can include all prior information on X and Y. Including any manufacturers information on the precision and accuracy of X and Y, any previous literature and expert opinion, and any physical relationship between them.

Then the posterior predictive distribution P(X|Y,y) would be what you want.
 
  • #63
haruspex said:
Alex has a set of experimental data of inputs X and outputs Y.
For the purposes of a further experiment, she wants to input a value x that would give the best chance of an output in the vicinity of y.
How should she choose the value of x?

That's almost a well defined mathematical problem. One thing that's missing is how Alex will pick the value of y. For example, would she pick it from a uniform distribution over an interval [y0, y1] where the experimental data exists? - or is she trying to predict a value of x that corresponds to a y-value that hasn't been observed yet?

Generally speaking, questions of the form "What is the probability of such-and-such given the observed data" require a Bayesian approach. Non-Bayesian approaches answer questions of the form "What is the probability of the observed data given that such-and-such is assumed to true?".
 
  • #64
Jarvis323 said:
Apparently, introducing multiplicative noise to x before OLS is equivalent to doing ridge regression, which is a form of regularized linear regression where the model is biased to reduce the variance. The intuition is that adding noise to x stretches out the distribution of x values, which in turn reduces the slope of the model.

http://madrury.github.io/jekyll/update/statistics/2017/08/12/noisy-regression.html

It is interesting I guess because as the slides I linked show, OLS doesn't really care about the variance in y so long as it's constant and normally distributed. But variance in x will lead to increased model bias (and reduce model variance)

So, for univariate OLS, the choice of using the lower or higher variance variable as the independent variable is actually an act of making a bias-variance tradeoff? And in general, you can always add noise to the independent variables if you want to, to increase bias/reduce model complexity/reduce model variance?
Are there "Reasonable" ways to check that the assumptions of Gauss-Markov : IID residuals with mean 0, pairwise independent and residuals independent of independent variables are met?
 
  • #65
Just an observation; not sure if it has been addressed: ( Not surprisingly) ,Unlike the purely geometric cases of a given line ## y=mx ## and its' 'Reciprocal' ##x= \frac {1}{m} y ## for which the product of the slopes is ## m \frac {1}{m} =1 ##, the regression lines have respective slopes ## r \frac { s_y}{s_x} ; r \frac { s_x}{ s_y} \neq 1 ## when ## r \neq \pm 1 ##. But maybe this is obvious.
 

Similar threads

  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 13 ·
Replies
13
Views
4K
  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 11 ·
Replies
11
Views
5K