Linear Regression: reversing the roles of X and Y

Click For Summary

Discussion Overview

The discussion revolves around the implications of reversing the roles of X and Y in simple linear regression. Participants explore whether the coefficients obtained from regressing Y on X can be derived from those obtained by regressing X on Y, and the mathematical relationships between these coefficients.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • Some participants express confusion about whether regressing X on Y will yield the same coefficients as regressing Y on X, or if it is ever possible to obtain the same coefficients.
  • One participant suggests that there is no general mathematical relationship linking the coefficients from the two regressions, although specific conditions might allow for equality.
  • Another participant notes that regressing X on Y does not make sense if X is treated as a fixed variable, as the dependent variable in regression should be random.
  • Some participants mention the concept of "inverse regression" and propose experimenting with data sets to observe the relationships between the coefficients.
  • It is noted that the lines obtained from minimizing vertical distances, horizontal distances, and perpendicular distances are not necessarily the same when reversing X and Y.
  • One participant provides a programmatic approach to generate data points and visually compare the regression lines, indicating that the lines differ based on the regression method used.
  • Another participant claims that the standardized slope coefficients and goodness of fit statistics will be identical between vertical and horizontal regressions, although this claim is challenged by others who reference differing results from their own examples.
  • Questions arise about the standard errors of the coefficients and the interpretation of these statistics in the context of the discussion.

Areas of Agreement / Disagreement

Participants do not reach a consensus on whether the coefficients from the two regression approaches can be equated. Multiple competing views remain regarding the mathematical relationships and implications of reversing X and Y.

Contextual Notes

Some limitations include the dependence on the nature of the data (e.g., fixed vs. random variables) and the specific conditions under which the coefficients might be equal. The discussion also highlights the complexity of regression analysis and the potential for differing interpretations of results.

  • #31
"I did not generate multiple data sets and measure m and b across them, so it is not valid to say that m and b are statistics in my analysis; they are simply dependent variables which were chosen such that, when treated as constants in a linear equation, a certain criterion is minimized."

Not true at all. Whenever you have any set of random data, collected or generated, the slope and intercept calculated from least squares are statistics. It may be awkward, or extremely difficult, to define a population, but they are statistics nevertheless. If you are saying they don't have a distribution because these values are based on one sample: we think of these just as we do every statistic: specific realizations of a random quantity.

"Although one could choose to generate multiple data sets, and look at the distribution of the m and b statistics across those data sets, this would not be useful in any way ..."
The distributions of the slope and intercept are conceptualized just as the distributions of sample means, standard deviations, etc. In textbooks all distributions in these situations are normal or t, in real life not so much, but the idea is the same.

"Therefore, the full class of available data sets has infinite range, and therefore cannot be randomly sampled and has no distribution. Even if you restricted the analysis to a fixed class of problems, say, where Y = mx + b + \epsilon , then you still have an infinite range of parameters which cannot be sampled!"

I'm not even sure what you mean here - it makes no sense.
 
Physics news on Phys.org
  • #32
statdad said:
"I did not generate multiple data sets and measure m and b across them, so it is not valid to say that m and b are statistics in my analysis; they are simply dependent variables which were chosen such that, when treated as constants in a linear equation, a certain criterion is minimized."

Not true at all. Whenever you have any set of random data, collected or generated, the slope and intercept calculated from least squares are statistics. It may be awkward, or extremely difficult, to define a population, but they are statistics nevertheless. If you are saying they don't have a distribution because these values are based on one sample: we think of these just as we do every statistic: specific realizations of a random quantity.

I'm sorry to continue arguing this, but what you are saying is simply not true. The definition of a statistic is relative to the experiment. If the sampling from an initial distribution is part of the experiment, then the results are statistics (even for a sample size of 1). But if you are making a measurement on a fixed set of data, the measurement is not a statistic -- it's simply a measurement.

If a colleague sends you a series of numbers with no explanation and asks for the linear regression, you are not going to tell him: "Sir, the sample mean of the slope based on 1 sample is 5". No, you're going to just tell him, "The data set has a slope of 5."

Now, it may be the case that this data was randomly generated by your colleague; in which case, he will record your measurement and conclude that the sample mean of the slope based on 1 sample is 5. However, from your perspective, this was an isolated experiment and the slope was not a statistic. Or, it could be the case that the data was not randomly generated, but is instead a permutation of the digits of the constant Pi.

Now, by your logic, perspective is irrelevant, and we should view every number in the world as a statistic. But that doesn't make sense. If you have 1 daughter, you would not say "The sample mean of the number of daughters I have is 1," because from your perspective, this number was not drawn randomly.

In my case, it makes no difference where the original data set came from. It so happens that I did generate the data randomly, but since I was not interested in measuring statistics, I chose the perspective that it was a single fixed data set and hence I do not refer to the results as a statistic.

Just like when sylas draw an L-configuration of 3 points, he referred to the line passing through them as having fixed parameters, he didn't say "The sample mean of the slope of the line is...X"

In short, we can choose to make statistical measurements whenever we like, and when we do, we refer to those statistical measurements as statistics only relative to the specific statistical experiment.

I'm not even sure what you mean here - it makes no sense.

What part confuses you?
 
  • #33
""Sir, the sample mean of the slope based on 1 sample is 5""

Since we never speak of the sample mean of a slope, this statement is meaningless. When you calculate a regression, if the estimates are the only item of interest, you are using regression in a very narrow, reckless, rather foolish and dangerous manner. You should always be interested in the accuracy of your (estimated) slope - whether you do a formal test, or confidence interval, whatever.

"However, from your perspective, this was an isolated experiment and the slope was not a statistic"
No, the slope here is a statistic - generated from data.

"...we should view every number in the world as a statistic. "

I never said, nor implied, any such thing.
 
  • #34
junglebeast, I did not ask for the t statistics because I did not trust your estimates, nor did I want them so I could see whether the coefficients of the regular (direct, forward, etc.) regression are statistically different from the coefficients on the reverse regression. I just wanted to verify that you were getting the same the t statistic for the slope coefficient between the two regressions.
 
  • #35
EnumaElish said:
junglebeast, I did not ask for the t statistics because I did not trust your estimates, nor did I want them so I could see whether the coefficients of the regular (direct, forward, etc.) regression are statistically different from the coefficients on the reverse regression. I just wanted to verify that you were getting the same the t statistic for the slope coefficient between the two regressions.

Thanks for clarifying your intent.
 

Similar threads

  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 6 ·
Replies
6
Views
3K
Replies
3
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 64 ·
3
Replies
64
Views
6K
  • · Replies 8 ·
Replies
8
Views
3K