Linear Regression: reversing the roles of X and Y

kingwinner · May 25, 2009

Simple linear regression:
Y = β0 + β1 *X + ε , where ε is random error

Fitted (predicted) value of Y for each X is:
^
Y = b0 + b1 *X (e.g. Y hat = 7.2 + 2.6 X)

Consider
^
X = b0' + b1' *Y

[the b0,b1,b0', and b1' are least-square estimates of the β's]

Prove whether or not we can get the values of bo,b1 from bo',b1'. If not, why not?

Completely clueless...Any help is greatly appreciated!

HallsofIvy · May 25, 2009

Start with [itex]y= b_0+ b_1x[/itex] and solve for x.

statdad · May 25, 2009

kingwinner said:

Simple linear regression:
Y = β0 + β1 *X + ε , where ε is random error

Fitted (predicted) value of Y for each X is:
^
Y = b0 + b1 *X (e.g. Y hat = 7.2 + 2.6 X)

Consider
^
X = b0' + b1' *Y

[the b0,b1,b0', and b1' are least-square estimates of the β's]

Prove whether or not we can get the values of bo,b1 from bo',b1'. If not, why not?

Completely clueless...Any help is greatly appreciated!

I'm a little confused about your question? Are you asking whether regressing X on Y will always give the same coefficients, or whether it is ever possible to get the same ones?

kingwinner · May 25, 2009

statdad said:

I'm a little confused about your question? Are you asking whether regressing X on Y will always give the same coefficients, or whether it is ever possible to get the same ones?

(X_i, Y_i), i=1,2,...n

Y hat is a fitted (predicted) value of Y based on fixed values of X.
Y hat = b0 + b1 *X with b0 and b1 being the least-square estimates.

For X hat, we are predicting the value of X from values of Y which would produce a different set of parameters, b0' and b1'. Is there any general mathematical relationship linking b0', b1' and b0, b1?

Thanks for answering!

kingwinner · May 27, 2009

Any help?

I think this is called "inverse regression"...

statdad · May 27, 2009

" Is there any general mathematical relationship linking b0', b1' and b0, b1?"
No. If you put some severe restrictions on the Ys and Xs you could come up with a situation in which the two sets are equal, but in general - no.

Also, note that in the situation where x is fixed (non-random), regressing x on Y makes no sense - the dependent variable in regression must be random.

This may be off-topic for you, but Graybill ("Theory and Application of the Linear Model": my copy is from 1976, a horrid-green cover) discusses a similar problem on pages 275-283: the problem in the book deals with this: If we observe a value of a random variable Y (say y0) in a regression model, how can we estimate the corresponding value of x?

mXSCNT · May 27, 2009

Kingwinner: the easiest first step is to try an example. Start with a random set of (X,Y) pairs and regress Y on X and see what the coefficients b0,b1 are. Then regress X on Y and see what the coefficients b0',b1' are. Do you see any simple relationship between b0,b1 and b0',b1'? (i.e. can you get b0',b1' by solving the equation y=b0+b1x for x?)

HallsofIvy · May 28, 2009

It can be shown that the line such that the sum of the vertical distances from points to the line, the line such that the sum of the horizontal distances from points to the line, and the line such that the sum of distances perpendicular to the line are all the same line. That says that reversing x and y will give the same line.

junglebeast · Jun 8, 2009

HallsofIvy said:

It can be shown that the line such that the sum of the vertical distances from points to the line, the line such that the sum of the horizontal distances from points to the line, and the line such that the sum of distances perpendicular to the line are all the same line. That says that reversing x and y will give the same line.

I would be interested to see that proof...

When linear regression is used to find the line in slope-intercept form, this is not the case...as a glaring example, consider that a vertical line cannot be represented, whereas a horizontal one can. If your data set is more vertical than horizontal, you will get a much better fit by reversing the order of X and Y series.

I quickly written a program to randomly generate some data points and compare visually the line that minimizes Y error ( yellow) to the line that minimizes X error ( purple) and the line that minimizes point-to-line distance (red). As you can see from this example, they are not always the same line.

In order to eliminate the possibility that these differences are simply due to rounding errors, I repeated the experiment using floating point precision, double precision, and 320-bits of floating point precision using GMP bignum. The results are the same in all cases, indicating that precision does not play a factor here.

http://img13.imageshack.us/img13/8336/regt.gif

Here's my source code:

Code:

#include "linalg\linear_least_squares.h"
#include "vision\drawing.h"
#include "stat\rng.h"
#include "linalg\null_space.h"
#include "bignum\bigfloat.h"

template<typename Real>
void linear_regression( const std::vector<Real> &X, const std::vector<Real> &Y,
					   Real &m, Real &b)
{
	Real sX=0, sY=0, sXY=0, sXX=0;
	for(unsigned i=0; i<X.size(); ++i)
	{
		Real x = X[i], y = Y[i];
		sX += x;
		sY += y;
		sXY += x*y;
		sXX += x*x;
	}
	Real n = X.size();

	m = (sY*sX - n*sXY)/( sX*sX - n*sXX );	
	b = (sX*sXY - sY*sXX)/( sX*sX - n*sXX );
}int main()
{
	using namespace heinlib;
	using namespace cimgl;

	bigfloat::set_default_precision(300);

	typedef bigfloat Real;

	bigfloat rr;
	printf("precision = %d\n", rr.precision() );

	CImg<float> image(250, 250, 1, 3, 0);

	std::vector<Real> X, Y;

	int N = 10;
	for(unsigned i=0; i<N; ++i)
	{
		Real x = random<Real>::uniform(0, 250);
		Real y = random<Real>::uniform(0, 250);

		image.draw_circle( x, y, 3, color<float>::white() );

		X.push_back(x);
		Y.push_back(y);
	}

	Real m1, b1,  m2, b2;
	linear_regression(X, Y, m1, b1 );
	linear_regression(Y, X, m2, b2 );

	//flip second line
	b2 = -b2/m2;
	m2 = 1/m2;
	
	cimg_draw_line( image, m1, b1, color<float>::yellow() );
	cimg_draw_line( image, m2, b2, color<float>::purple() );

	//find the means of X and Y
	Real mX = 0, mY = 0;
	for(unsigned i=0; i<N; ++i)
	{
		Real x = X[i], y = Y[i];
		mX += x;  mY += y;
	}
	mX /= N;
	mY /= N;

	//find least squares line by distance to line..
	Real sXX=0, sYY=0, sXY=0;
	for(unsigned i=0; i<N; ++i)
	{
		Real x = X[i] - mX, 
			y = Y[i] - mY;
		sXX += x*x;
		sYY += y*y;
		sXY += x*y;
	}

	static_matrix<2,2,Real> A = { sXX, sXY,
		sXY, sYY };
	static_matrix<2,1,Real> Norm;

	null_space_SVD(A, Norm);
	
	//general form
	static_matrix<3,1,Real> line = { Norm[0], Norm[1], -( mX*Norm[0] + mY*Norm[1] ) };
	cimg_draw_line( image, line, color<float>::red() );

	CImgDisplay disp(image);
	system("pause");
}

EnumaElish · Jun 8, 2009

Assuming no singularity (vertical or horizontal) exists in the data, the standardized slope coefficient b/s.e.(b) as well as the goodness of fit statistic (R squared) will be identical between a vertical regression (Y = b0 + b1 X + u) and the corresponding horizontal regression (X = a0 + a1 Y + v) .

junglebeast · Jun 9, 2009

EnumaElish said:

Assuming no singularity (vertical or horizontal) exists in the data, the standardized slope coefficient b/s.e.(b) as well as the goodness of fit statistic (R squared) will be identical between a vertical regression (Y = b0 + b1 X + u) and the corresponding horizontal regression (X = a0 + a1 Y + v) .

Well, you can say that...but you haven't given any formal proof or evidence of the claim, and it is contrary to the example I just showed, which I have made the source available for you to see.

You can observe the same effect using Excel's built in linear regression. The graphs are rotated and stretched, but notice that the lines go through different points in relation to each other.

http://img25.imageshack.us/img25/5005/reg2t.gif

The singularity is not present in either of the examples

EnumaElish · Jun 9, 2009

Can you provide either the standard errors (of the coefficients) or the t statistics?

junglebeast · Jun 9, 2009

EnumaElish said:

Can you provide either the standard errors (of the coefficients) or the t statistics?

Your question does not even make sense, as the coefficients are not random variables. The coefficients are mathematical solutions to the line equation for a fixed data set of points.

By showing that numerical precision was not responsible for their differences, this proves that the parameters of the recovered lines are indeed different (ie, different equations). At least, I cannot think of any other possible way of interpreting those results. Let me know if you can...

statdad · Jun 9, 2009

junglebeast said:

Your question does not even make sense, as the coefficients are not random variables. The coefficients are mathematical solutions to the line equation for a fixed data set of points.

By showing that numerical precision was not responsible for their differences, this proves that the parameters of the recovered lines are indeed different (ie, different equations). At least, I cannot think of any other possible way of interpreting those results. Let me know if you can...

The coefficients in a regression are statistics, so it certainly does make sense to talk about their standard errors.

Since [tex] R^2 [/tex] is simply the square of the correlation coefficient, that quantity will be the same whether you regress Y on x or X on y.

Sorry - hitting post too soon is the result of posting before morning coffee.

The slopes of Y on x and X on y won't be equal (unless you have an incredible stroke of luck), but the t-statistics in each case, used for testing
[tex] H\colon \beta = 0 [/tex], will be, since the test statistic for the slope can be written a a function of [tex] r^2 [/tex].

EnumaElish · Jun 9, 2009

(i) Y is random, (ii) b estimates are a function of Y, (iii) therefore estimated b's are random.

statdad · Jun 9, 2009

EnumaElish said:

(i) Y is random, (ii) b estimates are a function of Y, (iii) therefore estimated b's are random.

Er, I was agreeing with you earlier (if this post is aimed at me)

EnumaElish · Jun 9, 2009

No, I posted too soon. I was responding to junglebeast's comment "the coefficients are not random variables."

junglebeast · Jun 9, 2009

EnumaElish said:

(i) Y is random, (ii) b estimates are a function of Y, (iii) therefore estimated b's are random.

Initially, I generated X and Y randomly to make a fixed data set. Then I performed 9 tests on that fixed data set to get measurements of m and b. All of these m and b are comparable because they relate to the same data set.

If I were to generate X and Y and repeat the experiment multiple times, then yes, I could make m and b into random variables -- but this would be meaningless, because the "distribution" of m would have no mean and infinite variance, and that is not a distribution which the student t-test can be applied to in any meaningful way.

You claimed that all three equations were equivalent. I showed that, applying all three equations gives very different results. The only thing that differences an analytical solution from an empirical one is the precision of arithmetic. By demonstrating that increased precision does not change the results, this proves that the mathematical expressions in my program are not equivalent. This is why I made my source visible. If the source does compute linear regression properly, then this proves that flipping the order in regression is not mathematically equivalent.

Further, I think I can show that algebraically that it is not equal to reverse the role of X and Y. Let (m1, b1) be the line found by minimizing Y-error, and let (m2,b2) be the line found by minimizing X-error (after reversing the roles of X and Y),

[tex]
\begin{align}
y &= m1 x + b1\\
y &= m2 x + b2
\end{align}
[/tex]

By applying http://en.wikipedia.org/wiki/Linear_least_squares, we have

[tex]
\begin{align}
m1 &= \frac{\sum y \sum x - n \sum x y}{ (\sum x)^2 - n (\sum x^2)}\\
b1 &= \frac{ \sum x \sum x y - \sum y (\sum x^2)}{ (\sum x)^2 - n (\sum x^2)}
\end{align}
[/tex]

We can also directly calculate the equation after reversing the roles of X and Y, although this also flips the line, so let's refer to that line as (m2b, b2b):

[tex]
\begin{align}
m2b &= \frac{\sum y \sum x - n \sum x y}{ (\sum y)^2 - n (\sum y^2)}\\
b2b &= \frac{ \sum x \sum x y - \sum x (\sum y^2)}{ (\sum y)^2 - n (\sum y^2)}
\end{align}
[/tex]

Now we need to flip (m2b, b2b) into the same form as (m1,b1) for comparison. This rearrangement can be done by reversing x and y and putting back into slope-intercept form,

[tex]
\begin{align}
y &= \left(\frac{1}{m2b}\right)x + \left(-\frac{b2b}{m2b}\right) \\
&= m2 x + b2 \\
\end{align}
[/tex]

Thus, looking just at the slope,

[tex]
m2 &= \frac{ (\sum y)^2 - n (\sum y^2)}{\sum y \sum x - n \sum x y}\\
[/tex]

We can see that m1 is not equal to m2 -- so we do not obtain the same equation after reversing the roles of X and Y.

sylas · Jun 9, 2009

Here's another way to see it. Consider three points, in an L shape, as follows:

To get a line that minimizes vertical distances, it will mass midway between the two points at the same x-coordinate, and through the other. To get a line that minimizes horizontal distances, it will pass midway between the two points with the same y-coordinate, and through the other. These lines are shown above.

Therefore the regression line is not in general the same as the inverse regression line.

EnumaElish · Jun 9, 2009

"We can see that m1 is not equal to m2 -- so we do not obtain the same equation after reversing the roles of X and Y."

I see your point. I can't speak for HallsOfIvy, but the phrase "same equation" can be interpreted differently:

1. One might think from a statistical point of view what matters is not the slope estimate, but the standardized estimate of the slope (i.e. the t statistic of the slope parameter). This statistic is direction-free (vertical vs. horizontal).

2. If one can derive the parameters of the horizontal equation from the parameters of the vertical equation, then in an informational sense the two sets of estimates can be thought identical.

sylas · Jun 9, 2009

EnumaElish said:

2. If one can derive the parameters of the horizontal equation from the parameters of the vertical equation, then in an informational sense the two sets of estimates can be thought identical.

You can't. Consider the example I've given above. Now take also three points lying on the regression line, but not along the inverse regression line. You now have two sets of points, which give the same regression line, but a different inverse regression line.

It really is the case that the regression line and the inverse regression line are different entirely.

But try this exercise. Prove that the slope of the regression line and the inverse regression line cannot have opposite signs.

junglebeast · Jun 9, 2009

EnumaElish said:

1. One might think from a statistical point of view what matters is not the slope estimate, but the standardized estimate of the slope (i.e. the t statistic of the slope parameter). This statistic is direction-free (vertical vs. horizontal).

As I keep saying, it's not valid to look at the t-statistic of the slope parameter across different data sets. It has no meaning whatsoever.

EnumaElish · Jun 9, 2009

"You can't."

You answered the OP's question!

EnumaElish · Jun 9, 2009

junglebeast said:

As I keep saying, it's not valid to look at the t-statistic of the slope parameter across different data sets. It has no meaning whatsoever.

Thank you for sharing your point of view.

statdad · Jun 9, 2009

junglebeast said:

As I keep saying, it's not valid to look at the t-statistic of the slope parameter across different data sets. It has no meaning whatsoever.

I'm not sure what you mean here - we look at t-statistics from one regression problem to another all the time.

"Prove that the slope of the regression line and the inverse regression line cannot have opposite signs"

Since the sign of the slope is the same as the sign of the correlation, this isn't surprising.

junglebeast · Jun 9, 2009

statdad said:

The coefficients in a regression are statistics, so it certainly does make sense to talk about their standard errors.

Since [tex] R^2 [/tex] is simply the square of the correlation coefficient, that quantity will be the same whether you regress Y on x or X on y.

You say that as if R^2 can be computed from the regression -- it can't be. Computing R^2 and computing the regression are two fundamentally different things. R^2 will be the same if you reverse X and Y because they are treated symmetrically in the equation for R^2.

The slopes of Y on x and X on y won't be equal (unless you have an incredible stroke of luck), but the t-statistics in each case, used for testing
[tex] H\colon \beta = 0 [/tex], will be, since the test statistic for the slope can be written a a function of [tex] r^2 [/tex].

I think I finally figured out what you guys are talking about -- you must be referring to the "Slope of a regression line" from this article: http://en.wikipedia.org/wiki/Student's_t-test

Well, this test is only designed to tell you if the original data set can be described as

[tex]
Y_i = m X_i + b + e_i
[/tex]

where [tex]e_i[/tex] has expected value 0 and finite variance. Well, if you have a data set that follows this model, then you reverse the role of X and Y it can no longer follow that model, because the expected value of the residuals will no longer be zero.

statdad · Jun 9, 2009

"You say that as if R^2 can be computed from the regression -- it can't be"
Patently false. [tex] R^2 [/tex] is can be viewed as the square of the correlation coefficient OR computed in terms sums of squares from the regression anova. I did say that the value of [tex] R^2 [/tex] would be the same whether you regress Y on x or X on y.

"Well, if you have a data set that follows this model, then you reverse the role of X and Y it can no longer follow that model, because the expected value of the residuals will no longer be zero."

I'm not sure what you mean here. The classical regression model is

[tex]
Y = \alpha + \beta x + \varepsilon
[/tex]

Strictly the errors must be normal in distribution, but this can be relaxed somewhat.

If you decide to fit

[tex]
X = \alpha_2 + \beta_2 y + \varepsilon_2
[/tex]

then, to make any inferences, you are implicitly assuming that the errors have the required distributional properties.

If you are speaking to the sample residuals, then in both cases you will have [tex] \sum \hat e = 0 [/tex], since in linear regression the sum of the residuals is zero if and only if the intercept term is use.

If your post is to imply that in correlation there is no need to distinguish between dependent and independent variables, because [tex] r [/tex] and [tex] R^2 [/tex] are the same, while in regression you must have a firm idea on which is which before doing an analysis, you are correct.

junglebeast · Jun 9, 2009

statdad said:

"You say that as if R^2 can be computed from the regression -- it can't be"
Patently false. [tex] R^2 [/tex] is can be viewed as the square of the correlation coefficient OR computed in terms sums of squares from the regression anova. I did say that the value of [tex] R^2 [/tex] would be the same whether you regress Y on x or X on y.

It's true that you can compute R^2 in that way also, but that's not computing it purely from the regression, because you need an additional measurement of Var(Y), which you cannot obtain from the regression parameters (m, b, and SSE). However, this alternate computation does use the regression error, so I see what you meant by your original statement now...so let's chalk that up to a difference of semantics.

The point I was attempting to prove to EnumaElish was that you get a fundamentally different line by exchanging the roles of X and Y in the regression. He didn't trust my results, and asked to see a t-test.

If you have 2 RV's, you can use the t-test to test the hypothesis that the two RV's are part of the same distribution. It was my understanding that EnumaElish wanted me to do a test like this to see if the parameters of the two lines were both of the same distribution; and hence, essentially the same line.

This, I know, cannot possibly be tested in such a way..because the line parameters are not RV's that have a unique distribution. The points that I use to generate the data set do have a specific distribution, but the resulting parameters m and be do not...and it would not be meaningful to generate random distributions, then generate a distribution of "m" and "b"-values using both methods, and then use the t-test to check whether or not the "m"-distributions were equal and the "b"-distributions were equal. This would not be valid because none of those 4 distributions have a finite mean or variance. They are not Gaussian and the t-test would be useless at comparing them.

This is greatly different from the use of the t-test which I think you are referring to, which tests a very different hypothesis: "are the m and b values correct, assuming that the data follows a linear + gaussian noise model?" Perhaps this is what EnumaElish was asking. If it was, then that test would have failed anyway, because the data was specifically generated to be as far from that model as possible -- I just picked X and Y to be random points in the unit cube!

I'm not sure what you mean here. The classical regression model is...

I may not have phrased it properly before. Let my try again. Say you have data of the form:

[tex]
Y = m X + b + \varepsilon
[/tex]

Where [tex]\varepsilon[/tex] is drawn from [tex]N(\mu, sigma)[/tex].

Well, if you reverse the role of X and Y, it will not be the case that

[tex]
X = m_2 Y + b_2 + \varepsilon_2
[/tex]

Where [tex]\varepsilon_2[/tex] is drawn from [tex]N(\mu_2, sigma_2)[/tex]. In other words, the errors which were normally distributed with regard to the (X,Y) line are not normally distributed with regard to the (Y,X) line.

statdad · Jun 9, 2009

junglebeast said:

It's true that you can compute R^2 in that way also, but that's not computing it purely from the regression, because you need an additional measurement of Var(Y), which you cannot obtain from the regression parameters (m, b, and SSE). However, this alternate computation does use the regression error, so I see what you meant by your original statement now...so let's chalk that up to a difference of semantics.

junglebeast said:

The point I was attempting to prove to EnumaElish was that you get a fundamentally different line by exchanging the roles of X and Y in the regression. He didn't trust my results, and asked to see a t-test.

I always agreed with this.

junglebeast said:

If you have 2 RV's, you can use the t-test to test the hypothesis that the two RV's are part of the same distribution. It was my understanding that EnumaElish wanted me to do a test like this to see if the parameters of the two lines were both of the same distribution; and hence, essentially the same line.

In a limited sense: the classical t-test is used to assess whether two sets of data come from normal populations having the same mean. Failure to reject H0 leaves us with a null hypothesis of equality of means: if we are willing to assume equality of variances, then the two distributions could be seen as equal.

junglebeast said:

This, I know, cannot possibly be tested in such a way..because the line parameters are not RV's that have a unique distribution. The points that I use to generate the data set do have a specific distribution, but the resulting parameters m and be do not...and it would not be meaningful to generate random distributions, then generate a distribution of "m" and "b"-values using both methods, and then use the t-test to check whether or not the "m"-distributions were equal and the "b"-distributions were equal. This would not be valid because none of those 4 distributions have a finite mean or variance. They are not Gaussian and the t-test would be useless at comparing them.

junglebeast said:

I agree that you can't use a t-test to compare slopes from problems where you switch the roles of X and Y. I'm not sure why you say the "parameters m and b" - they aren't parameters, they are statistics: even though you generated the data. You seem to acknowledge this when you mention their distributions: parameters don't have distributions, statistics do

junglebeast said:

This is greatly different from the use of the t-test which I think you are referring to, which tests a very different hypothesis: "are the m and b values correct, assuming that the data follows a linear + gaussian noise model?" Perhaps this is what EnumaElish was asking. If so, then I mis-interpreted him.

My comment about the two t-statistics was simply this:
* Whether you regress Y on x or X on y, the correlation coefficient is the same
* Since the correlation is the same, [tex] R^2 [/tex] is the same
* The t-statistic in any linear regression problem can be expressed in terms of [tex] R^2 [/tex]: since the two different scenarios have the same [tex] R^2 [/tex], the values of the test statistic will always be equal - and I meant to imply nothing more than the equality of the values, no specific utility. If I failed at that I apologize

junglebeast said:

I may not have phrased it properly before. Let my try again. Say you have data of the form:

[tex]
Y = m X + b + \varepsilon
[/tex]

Where [tex]\varepsilon[/tex] is drawn from [tex]N(\mu, \sigma)[/tex].

Well, if you reverse the role of X and Y, it will not be the case that

[tex]
X = m_2 Y + b_2 + \varepsilon_2
[/tex]

Where [tex]\varepsilon_2[/tex] is drawn from [tex]N(\mu_2, \sigma_2)[/tex]. In other words, the errors which were normally distributed with regard to the (X,Y) line are not normally distributed with regard to the (Y,X) line.

This is, I think, another case of semantics on our parts. My point was simply that
if someone tries the model [tex] X = \alpha + \beta y + \varepsilon [/tex], then he or she
is implicitly believing the correct form for the error distribution. It won't be the same [tex] n(0, \sigma^2) [/tex] distribution as for the traditional model, but that is the implicit assumption, rightly or wrongly.

junglebeast · Jun 9, 2009

statdad said:

I'm not sure why you say the "parameters m and b" - they aren't parameters, they are statistics: even though you generated the data. You seem to acknowledge this when you mention their distributions: parameters don't have distributions, statistics do.

You accidentally put your response as a quote from me. Confused me for a bit, but now let me reply. I'm in agreement with you on everything else so I'm only replying to this little bit:

m and b are both:

1) Input parameters of a line (constants)
2) Solution results of a linear regression (dependent variables)

If we generate multiple data sets where the data sets come from particular distributions, then m and b can also be viewed as statistics which have distributions.

We now have three different contexts in which m and b have meaning. The first two contexts are functional, and the last is statistical.

I did not generate multiple data sets and measure m and b across them, so it is not valid to say that m and b are statistics in my analysis; they are simply dependent variables which were chosen such that, when treated as constants in a linear equation, a certain criterion is minimized.

Although one could choose to generate multiple data sets, and look at the distribution of the m and b statistics across those data sets, this would not be useful in any way because:

a) The discussion we are having pertains to the application of linear regression in a generalized sense, NOT to the application of linear regression under the restricted case where the data follows the model of [tex]Y = mX + b + \varepsilon[/tex]. Therefore, the full class of available data sets has infinite range, and therefore cannot be randomly sampled and has no distribution. Even if you restricted the analysis to a fixed class of problems, say, where [tex]Y = mX + b + \varepsilon[/tex], then you still have an infinite range of parameters which cannot be sampled!

b) Even if you choose a highly restricted domain such as [tex]Y = mX + b + \varepsilon[/tex] where X,Y,m,b,eps are all sampled in the range [-1,1] ...then even in that case, you will find that the distributions of m and b are not Gaussian, and therefore, not applicable to use the t-test on.

statdad · Jun 9, 2009

"I did not generate multiple data sets and measure m and b across them, so it is not valid to say that m and b are statistics in my analysis; they are simply dependent variables which were chosen such that, when treated as constants in a linear equation, a certain criterion is minimized."

Not true at all. Whenever you have any set of random data, collected or generated, the slope and intercept calculated from least squares are statistics. It may be awkward, or extremely difficult, to define a population, but they are statistics nevertheless. If you are saying they don't have a distribution because these values are based on one sample: we think of these just as we do every statistic: specific realizations of a random quantity.

"Although one could choose to generate multiple data sets, and look at the distribution of the m and b statistics across those data sets, this would not be useful in any way ..."
The distributions of the slope and intercept are conceptualized just as the distributions of sample means, standard deviations, etc. In textbooks all distributions in these situations are normal or t, in real life not so much, but the idea is the same.

"Therefore, the full class of available data sets has infinite range, and therefore cannot be randomly sampled and has no distribution. Even if you restricted the analysis to a fixed class of problems, say, where [tex] Y = mx + b + \epsilon [/tex] , then you still have an infinite range of parameters which cannot be sampled!"

I'm not even sure what you mean here - it makes no sense.

junglebeast · Jun 9, 2009

statdad said:

"I did not generate multiple data sets and measure m and b across them, so it is not valid to say that m and b are statistics in my analysis; they are simply dependent variables which were chosen such that, when treated as constants in a linear equation, a certain criterion is minimized."

Not true at all. Whenever you have any set of random data, collected or generated, the slope and intercept calculated from least squares are statistics. It may be awkward, or extremely difficult, to define a population, but they are statistics nevertheless. If you are saying they don't have a distribution because these values are based on one sample: we think of these just as we do every statistic: specific realizations of a random quantity.

I'm sorry to continue arguing this, but what you are saying is simply not true. The definition of a statistic is relative to the experiment. If the sampling from an initial distribution is part of the experiment, then the results are statistics (even for a sample size of 1). But if you are making a measurement on a fixed set of data, the measurement is not a statistic -- it's simply a measurement.

If a colleague sends you a series of numbers with no explanation and asks for the linear regression, you are not going to tell him: "Sir, the sample mean of the slope based on 1 sample is 5". No, you're going to just tell him, "The data set has a slope of 5."

Now, it may be the case that this data was randomly generated by your colleague; in which case, he will record your measurement and conclude that the sample mean of the slope based on 1 sample is 5. However, from your perspective, this was an isolated experiment and the slope was not a statistic. Or, it could be the case that the data was not randomly generated, but is instead a permutation of the digits of the constant Pi.

Now, by your logic, perspective is irrelevant, and we should view every number in the world as a statistic. But that doesn't make sense. If you have 1 daughter, you would not say "The sample mean of the number of daughters I have is 1," because from your perspective, this number was not drawn randomly.

In my case, it makes no difference where the original data set came from. It so happens that I did generate the data randomly, but since I was not interested in measuring statistics, I chose the perspective that it was a single fixed data set and hence I do not refer to the results as a statistic.

Just like when sylas draw an L-configuration of 3 points, he referred to the line passing through them as having fixed parameters, he didn't say "The sample mean of the slope of the line is...X"

In short, we can choose to make statistical measurements whenever we like, and when we do, we refer to those statistical measurements as statistics only relative to the specific statistical experiment.

I'm not even sure what you mean here - it makes no sense.

What part confuses you?

statdad · Jun 9, 2009

""Sir, the sample mean of the slope based on 1 sample is 5""

Since we never speak of the sample mean of a slope, this statement is meaningless. When you calculate a regression, if the estimates are the only item of interest, you are using regression in a very narrow, reckless, rather foolish and dangerous manner. You should always be interested in the accuracy of your (estimated) slope - whether you do a formal test, or confidence interval, whatever.

"However, from your perspective, this was an isolated experiment and the slope was not a statistic"
No, the slope here is a statistic - generated from data.

"...we should view every number in the world as a statistic. "

I never said, nor implied, any such thing.

EnumaElish · Jun 9, 2009

junglebeast, I did not ask for the t statistics because I did not trust your estimates, nor did I want them so I could see whether the coefficients of the regular (direct, forward, etc.) regression are statistically different from the coefficients on the reverse regression. I just wanted to verify that you were getting the same the t statistic for the slope coefficient between the two regressions.

junglebeast · Jun 9, 2009

EnumaElish said:

junglebeast, I did not ask for the t statistics because I did not trust your estimates, nor did I want them so I could see whether the coefficients of the regular (direct, forward, etc.) regression are statistically different from the coefficients on the reverse regression. I just wanted to verify that you were getting the same the t statistic for the slope coefficient between the two regressions.

Thanks for clarifying your intent.

Linear Regression: reversing the roles of X and Y

1. What is the purpose of reversing the roles of X and Y in linear regression?

2. How does reversing the roles of X and Y affect the interpretation of the regression line?

3. Can I use the same regression equation for both the original and reversed data?

4. What are the limitations of reversing the roles of X and Y in linear regression?

5. How can I determine which variable to use as X and which to use as Y in linear regression?

Similar threads

Hot Threads

Recent Insights