# Linear Regression: reversing the roles of X and Y

by kingwinner
Tags: linear, regression, reversing, roles
 P: 1,270 Simple linear regression: Y = β0 + β1 *X + ε , where ε is random error Fitted (predicted) value of Y for each X is: ^ Y = b0 + b1 *X (e.g. Y hat = 7.2 + 2.6 X) Consider ^ X = b0' + b1' *Y [the b0,b1,b0', and b1' are least-square estimates of the β's] Prove whether or not we can get the values of bo,b1 from bo',b1'. If not, why not? Completely clueless...Any help is greatly appreciated!
 Math Emeritus Sci Advisor Thanks PF Gold P: 39,339 Start with $y= b_0+ b_1x$ and solve for x.
HW Helper
P: 1,361
 Quote by kingwinner Simple linear regression: Y = β0 + β1 *X + ε , where ε is random error Fitted (predicted) value of Y for each X is: ^ Y = b0 + b1 *X (e.g. Y hat = 7.2 + 2.6 X) Consider ^ X = b0' + b1' *Y [the b0,b1,b0', and b1' are least-square estimates of the β's] Prove whether or not we can get the values of bo,b1 from bo',b1'. If not, why not? Completely clueless...Any help is greatly appreciated!
I'm a little confused about your question? Are you asking whether regressing X on Y will always give the same coefficients, or whether it is ever possible to get the same ones?

P: 1,270
Linear Regression: reversing the roles of X and Y

 Quote by statdad I'm a little confused about your question? Are you asking whether regressing X on Y will always give the same coefficients, or whether it is ever possible to get the same ones?
(X_i, Y_i), i=1,2,...n

Y hat is a fitted (predicted) value of Y based on fixed values of X.
Y hat = b0 + b1 *X with b0 and b1 being the least-square estimates.

For X hat, we are predicting the value of X from values of Y which would produce a different set of parameters, b0' and b1'. Is there any general mathematical relationship linking b0', b1' and b0, b1?

 P: 1,270 Any help? I think this is called "inverse regression"...
 HW Helper P: 1,361 " Is there any general mathematical relationship linking b0', b1' and b0, b1?" No. If you put some severe restrictions on the Ys and Xs you could come up with a situation in which the two sets are equal, but in general - no. Also, note that in the situation where x is fixed (non-random), regressing x on Y makes no sense - the dependent variable in regression must be random. This may be off-topic for you, but Graybill ("Theory and Application of the Linear Model": my copy is from 1976, a horrid-green cover) discusses a similar problem on pages 275-283: the problem in the book deals with this: If we observe a value of a random variable Y (say y0) in a regression model, how can we estimate the corresponding value of x?
 P: 330 Kingwinner: the easiest first step is to try an example. Start with a random set of (X,Y) pairs and regress Y on X and see what the coefficients b0,b1 are. Then regress X on Y and see what the coefficients b0',b1' are. Do you see any simple relationship between b0,b1 and b0',b1'? (i.e. can you get b0',b1' by solving the equation y=b0+b1x for x?)
 Math Emeritus Sci Advisor Thanks PF Gold P: 39,339 It can be shown that the line such that the sum of the vertical distances from points to the line, the line such that the sum of the horizontal distances from points to the line, and the line such that the sum of distances perpendicular to the line are all the same line. That says that reversing x and y will give the same line.
P: 462
 Quote by HallsofIvy It can be shown that the line such that the sum of the vertical distances from points to the line, the line such that the sum of the horizontal distances from points to the line, and the line such that the sum of distances perpendicular to the line are all the same line. That says that reversing x and y will give the same line.
I would be interested to see that proof...

When linear regression is used to find the line in slope-intercept form, this is not the case...as a glaring example, consider that a vertical line cannot be represented, whereas a horizontal one can. If your data set is more vertical than horizontal, you will get a much better fit by reversing the order of X and Y series.

I quickly written a program to randomly generate some data points and compare visually the line that minimizes Y error ( yellow) to the line that minimizes X error ( purple) and the line that minimizes point-to-line distance (red). As you can see from this example, they are not always the same line.

In order to eliminate the possibility that these differences are simply due to rounding errors, I repeated the experiment using floating point precision, double precision, and 320-bits of floating point precision using GMP bignum. The results are the same in all cases, indicating that precision does not play a factor here.

Here's my source code:
#include "linalg\linear_least_squares.h"
#include "vision\drawing.h"
#include "stat\rng.h"
#include "linalg\null_space.h"
#include "bignum\bigfloat.h"

template<typename Real>
void linear_regression( const std::vector<Real> &X, const std::vector<Real> &Y,
Real &m, Real &b)
{
Real sX=0, sY=0, sXY=0, sXX=0;
for(unsigned i=0; i<X.size(); ++i)
{
Real x = X[i], y = Y[i];
sX += x;
sY += y;
sXY += x*y;
sXX += x*x;
}
Real n = X.size();

m = (sY*sX - n*sXY)/( sX*sX - n*sXX );
b = (sX*sXY - sY*sXX)/( sX*sX - n*sXX );
}

int main()
{
using namespace heinlib;
using namespace cimgl;

bigfloat::set_default_precision(300);

typedef bigfloat Real;

bigfloat rr;
printf("precision = %d\n", rr.precision() );

CImg<float> image(250, 250, 1, 3, 0);

std::vector<Real> X, Y;

int N = 10;
for(unsigned i=0; i<N; ++i)
{
Real x = random<Real>::uniform(0, 250);
Real y = random<Real>::uniform(0, 250);

image.draw_circle( x, y, 3, color<float>::white() );

X.push_back(x);
Y.push_back(y);
}

Real m1, b1,  m2, b2;
linear_regression(X, Y, m1, b1 );
linear_regression(Y, X, m2, b2 );

//flip second line
b2 = -b2/m2;
m2 = 1/m2;

cimg_draw_line( image, m1, b1, color<float>::yellow() );
cimg_draw_line( image, m2, b2, color<float>::purple() );

//find the means of X and Y
Real mX = 0, mY = 0;
for(unsigned i=0; i<N; ++i)
{
Real x = X[i], y = Y[i];
mX += x;  mY += y;
}
mX /= N;
mY /= N;

//find least squares line by distance to line..
Real sXX=0, sYY=0, sXY=0;
for(unsigned i=0; i<N; ++i)
{
Real x = X[i] - mX,
y = Y[i] - mY;
sXX += x*x;
sYY += y*y;
sXY += x*y;
}

static_matrix<2,2,Real> A = { sXX, sXY,
sXY, sYY };
static_matrix<2,1,Real> Norm;

null_space_SVD(A, Norm);

//general form
static_matrix<3,1,Real> line = { Norm[0], Norm[1], -( mX*Norm[0] + mY*Norm[1] ) };
cimg_draw_line( image, line, color<float>::red() );

CImgDisplay disp(image);
system("pause");
}
 Sci Advisor HW Helper P: 2,482 Assuming no singularity (vertical or horizontal) exists in the data, the standardized slope coefficient b/s.e.(b) as well as the goodness of fit statistic (R squared) will be identical between a vertical regression (Y = b0 + b1 X + u) and the corresponding horizontal regression (X = a0 + a1 Y + v) .
P: 462
 Quote by EnumaElish Assuming no singularity (vertical or horizontal) exists in the data, the standardized slope coefficient b/s.e.(b) as well as the goodness of fit statistic (R squared) will be identical between a vertical regression (Y = b0 + b1 X + u) and the corresponding horizontal regression (X = a0 + a1 Y + v) .
Well, you can say that...but you haven't given any formal proof or evidence of the claim, and it is contrary to the example I just showed, which I have made the source available for you to see.

You can observe the same effect using Excel's built in linear regression. The graphs are rotated and stretched, but notice that the lines go through different points in relation to each other.

The singularity is not present in either of the examples
 Sci Advisor HW Helper P: 2,482 Can you provide either the standard errors (of the coefficients) or the t statistics?
P: 462
 Quote by EnumaElish Can you provide either the standard errors (of the coefficients) or the t statistics?
Your question does not even make sense, as the coefficients are not random variables. The coefficients are mathematical solutions to the line equation for a fixed data set of points.

By showing that numerical precision was not responsible for their differences, this proves that the parameters of the recovered lines are indeed different (ie, different equations). At least, I cannot think of any other possible way of interpreting those results. Let me know if you can...
HW Helper
P: 1,361
 Quote by junglebeast Your question does not even make sense, as the coefficients are not random variables. The coefficients are mathematical solutions to the line equation for a fixed data set of points. By showing that numerical precision was not responsible for their differences, this proves that the parameters of the recovered lines are indeed different (ie, different equations). At least, I cannot think of any other possible way of interpreting those results. Let me know if you can...
The coefficients in a regression are statistics, so it certainly does make sense to talk about their standard errors.

Since $$R^2$$ is simply the square of the correlation coefficient, that quantity will be the same whether you regress Y on x or X on y.

Sorry - hitting post too soon is the result of posting before morning coffee.

The slopes of Y on x and X on y won't be equal (unless you have an incredible stroke of luck), but the t-statistics in each case, used for testing
$$H\colon \beta = 0$$, will be, since the test statistic for the slope can be written a a function of $$r^2$$.
 Sci Advisor HW Helper P: 2,482 (i) Y is random, (ii) b estimates are a function of Y, (iii) therefore estimated b's are random.
HW Helper
P: 1,361
 Quote by EnumaElish (i) Y is random, (ii) b estimates are a function of Y, (iii) therefore estimated b's are random.
Er, I was agreeing with you earlier (if this post is aimed at me)
 Sci Advisor HW Helper P: 2,482 No, I posted too soon. I was responding to junglebeast's comment "the coefficients are not random variables."
P: 462
 Quote by EnumaElish (i) Y is random, (ii) b estimates are a function of Y, (iii) therefore estimated b's are random.
Initially, I generated X and Y randomly to make a fixed data set. Then I performed 9 tests on that fixed data set to get measurements of m and b. All of these m and b are comparable because they relate to the same data set.

If I were to generate X and Y and repeat the experiment multiple times, then yes, I could make m and b into random variables -- but this would be meaningless, because the "distribution" of m would have no mean and infinite variance, and that is not a distribution which the student t-test can be applied to in any meaningful way.

You claimed that all three equations were equivalent. I showed that, applying all three equations gives very different results. The only thing that differences an analytical solution from an empirical one is the precision of arithmetic. By demonstrating that increased precision does not change the results, this proves that the mathematical expressions in my program are not equivalent. This is why I made my source visible. If the source does compute linear regression properly, then this proves that flipping the order in regression is not mathematically equivalent.

Further, I think I can show that algebraically that it is not equal to reverse the role of X and Y. Let (m1, b1) be the line found by minimizing Y-error, and let (m2,b2) be the line found by minimizing X-error (after reversing the roles of X and Y),

\begin{align} y &= m1 x + b1\\ y &= m2 x + b2 \end{align}

By applying http://en.wikipedia.org/wiki/Linear_least_squares, we have

\begin{align} m1 &= \frac{\sum y \sum x - n \sum x y}{ (\sum x)^2 - n (\sum x^2)}\\ b1 &= \frac{ \sum x \sum x y - \sum y (\sum x^2)}{ (\sum x)^2 - n (\sum x^2)} \end{align}

We can also directly calculate the equation after reversing the roles of X and Y, although this also flips the line, so let's refer to that line as (m2b, b2b):

\begin{align} m2b &= \frac{\sum y \sum x - n \sum x y}{ (\sum y)^2 - n (\sum y^2)}\\ b2b &= \frac{ \sum x \sum x y - \sum x (\sum y^2)}{ (\sum y)^2 - n (\sum y^2)} \end{align}

Now we need to flip (m2b, b2b) into the same form as (m1,b1) for comparison. This rearrangement can be done by reversing x and y and putting back into slope-intercept form,

\begin{align} y &= \left(\frac{1}{m2b}\right)x + \left(-\frac{b2b}{m2b}\right) \\ &= m2 x + b2 \\ \end{align}

Thus, looking just at the slope,

$$m2 &= \frac{ (\sum y)^2 - n (\sum y^2)}{\sum y \sum x - n \sum x y}\\$$

We can see that m1 is not equal to m2 -- so we do not obtain the same equation after reversing the roles of X and Y.

 Related Discussions Set Theory, Logic, Probability, Statistics 2 Engineering, Comp Sci, & Technology Homework 1 Calculus & Beyond Homework 1 Programming & Computer Science 3 Calculus & Beyond Homework 1