Can We Use Regression Line x on y if y is the Dependent Variable?

Click For Summary
SUMMARY

The discussion centers on the appropriate use of regression lines when estimating one variable from another, specifically when y is the dependent variable. It is established that if the goal is to estimate x based on y, the regression line x on y should be used, provided that y is measured with less precision. Conversely, if x is known more precisely, the regression line y on x is recommended. The assumption of Ordinary Least Squares (OLS) regression is that all error resides in the dependent variable, which influences the choice of regression line.

PREREQUISITES
  • Understanding of Ordinary Least Squares (OLS) regression
  • Familiarity with correlation coefficients, specifically the product moment correlation coefficient (r)
  • Knowledge of linear regression concepts and terminology
  • Ability to assess measurement precision of variables
NEXT STEPS
  • Study the implications of measurement error in regression analysis
  • Learn about the differences between regression lines x on y and y on x
  • Explore the concept of sum-squared-errors in linear regression
  • Investigate advanced regression techniques that account for measurement error
USEFUL FOR

Data analysts, statisticians, and researchers involved in regression analysis and model fitting will benefit from this discussion, particularly those interested in understanding the nuances of dependent and independent variables in regression contexts.

songoku
Messages
2,509
Reaction score
393
TL;DR
Let say I have 10 bivariate data (x and y) where x is the independent variable and x is the dependent variable.

I want to estimate the value of x from a certain given value of y. Which regression line should I use, regression line y on x or regression line x on y?
I have note that states regression line x on y is used when we want to calculate x for given y but in this case y is dependent variable. I am pretty sure I can use either line if the value of product moment correlation coefficient (r) is close to 1 but for the case, let say r = 0.6, can we use regression line x on y even though y is dependent variable? Or should we use regression line y on x to calculate the value of x?

Thanks
 
Physics news on Phys.org
The important thing is which is measured/known most precisely. That should be the independent variable. The assumption of OLS regression is that all of the error is in the dependent variable.
 
Dale said:
The important thing is which is measured/known most precisely. That should be the independent variable. The assumption of OLS regression is that all of the error is in the dependent variable.
I see, so I should use regression line y on x even if I want to estimate x.

But I am sorry, I have another question. Can I argue that for the case y is given by the question (y is still the dependent variable), y is known more precisely so the appropriate regression line is x on y?

Thanks
 
songoku said:
I see, so I should use regression line y on x even if I want to estimate x.
Yes, where y is the thing which has a large error and x is measured almost exactly.

songoku said:
But I am sorry, I have another question. Can I argue that for the case y is given by the question (y is still the dependent variable), y is known more precisely so the appropriate regression line is x on y?
This isn’t a matter of argument. How well can you measure the values of y? How well can you measure the values of x? The answer to those questions determines the method you should use.
 
If you want to estimate x based on the values of y, you should do a regression of x on y (x dependent and y independent). Linear regression would minimize the sum-squared-errors of the sampled ##x_i## verses the estimated ##\hat{x_i}(y_i)##. Doing a regression the other way would minimize the wrong sum-squared-errors and all the related statistics would be wrong.
 
Dale said:
Yes, where y is the thing which has a large error and x is measured almost exactly.

This isn’t a matter of argument. How well can you measure the values of y? How well can you measure the values of x? The answer to those questions determines the method you should use.
I understand

FactChecker said:
If you want to estimate x based on the values of y, you should do a regression of x on y (x dependent and y independent). Linear regression would minimize the sum-squared-errors of the sampled ##x_i## verses the estimated ##\hat{x_i}(y_i)##. Doing a regression the other way would minimize the wrong sum-squared-errors and all the related statistics would be wrong.
How about if the data I have is only x as independent variable and y is dependent variable and I need to estimate x for given y?

Thanks
 
songoku said:
How about if the data I have is only x as independent variable and y is dependent variable and I need to estimate x for given y?
That is what I am talking about. They are both linear regression problems. However, the coefficients you get from the two linear regressions are not the same or even easily related. The errors in the sum-squared-error that are minimized in the linear regressions are projections onto different axes. (That is, minimizing ##\sum (y_i-\hat{y_i})^2## is not the same as minimizing ##\sum (x_i-\hat{x_i})^2##.)
So you should do a linear regression with X as a linear function of Y.
The issue is not how well the X and Y values can be measured, it is how well the values fit the selected model. That is the sum-squared-error that is being minimized.
 
FactChecker said:
That is what I am talking about. They are both linear regression problems. However, the coefficients you get from the two linear regressions are not the same or even easily related. The errors in the sum-squared-error that are minimized in the linear regressions are projections onto different axes. (That is, minimizing ##\sum (y_i-\hat{y_i})^2## is not the same as minimizing ##\sum (x_i-\hat{x_i})^2##.)
So you should do a linear regression with X as a linear function of Y.
The issue is not how well the X and Y values can be measured, it is how well the values fit the selected model. That is the sum-squared-error that is being minimized.
I understand your explanation but why it seems to me that your suggestion is different from @Dale 's? Or maybe I misinterpret something?

Dale said:
How well can you measure the values of y? How well can you measure the values of x? The answer to those questions determines the method you should use.
Dale said:
The important thing is which is measured/known most precisely. That should be the independent variable. The assumption of OLS regression is that all of the error is in the dependent variable.

From those replies, the one that becomes independent variable is the one that can be measured more precisely, which is ##x## in my case so the regression line that should be used is ##y## on ##x##, even though I want to estimate ##x## from ##y##

But from your reply (@FactChecker ), I should use regression line ##x## on ##y## because I want to estimate ##x## for given ##y## so that the value of the estimation suits the model (the error in ##x## is minimized) even though my independent variable is ##x## (I can't change ##y## to be independent variable)

Am I correct to think that there are two different suggestions for my hypothetical case?

Thanks
 
songoku said:
From those replies, the one that becomes independent variable is the one that can be measured more precisely, which is x in my case so the regression line that should be used is y on x, even though I want to estimate x from y
Yes, this is correct.

@FactChecker can confirm, but I don’t think that he is disagreeing with me. He is just showing you why the two choices are not equivalent.
 

Similar threads

  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 8 ·
Replies
8
Views
3K
Replies
3
Views
3K
  • · Replies 13 ·
Replies
13
Views
4K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 64 ·
3
Replies
64
Views
5K