Linear Regression: reversing the roles of X and Y

Click For Summary
Reversing the roles of X and Y in linear regression does not yield the same coefficients due to the inherent differences in the relationships being modeled. While the fitted values for Y based on X are derived from minimizing vertical distances, predicting X from Y minimizes horizontal distances, leading to different parameter estimates. The discussion highlights that, in general, there is no direct mathematical relationship linking the coefficients from these two regressions. Additionally, the concept of "inverse regression" is mentioned, but it is clarified that the two sets of coefficients are typically not equal. The conversation emphasizes the importance of understanding the context and assumptions of regression analysis when interpreting results.
  • #31
"I did not generate multiple data sets and measure m and b across them, so it is not valid to say that m and b are statistics in my analysis; they are simply dependent variables which were chosen such that, when treated as constants in a linear equation, a certain criterion is minimized."

Not true at all. Whenever you have any set of random data, collected or generated, the slope and intercept calculated from least squares are statistics. It may be awkward, or extremely difficult, to define a population, but they are statistics nevertheless. If you are saying they don't have a distribution because these values are based on one sample: we think of these just as we do every statistic: specific realizations of a random quantity.

"Although one could choose to generate multiple data sets, and look at the distribution of the m and b statistics across those data sets, this would not be useful in any way ..."
The distributions of the slope and intercept are conceptualized just as the distributions of sample means, standard deviations, etc. In textbooks all distributions in these situations are normal or t, in real life not so much, but the idea is the same.

"Therefore, the full class of available data sets has infinite range, and therefore cannot be randomly sampled and has no distribution. Even if you restricted the analysis to a fixed class of problems, say, where Y = mx + b + \epsilon , then you still have an infinite range of parameters which cannot be sampled!"

I'm not even sure what you mean here - it makes no sense.
 
Physics news on Phys.org
  • #32
statdad said:
"I did not generate multiple data sets and measure m and b across them, so it is not valid to say that m and b are statistics in my analysis; they are simply dependent variables which were chosen such that, when treated as constants in a linear equation, a certain criterion is minimized."

Not true at all. Whenever you have any set of random data, collected or generated, the slope and intercept calculated from least squares are statistics. It may be awkward, or extremely difficult, to define a population, but they are statistics nevertheless. If you are saying they don't have a distribution because these values are based on one sample: we think of these just as we do every statistic: specific realizations of a random quantity.

I'm sorry to continue arguing this, but what you are saying is simply not true. The definition of a statistic is relative to the experiment. If the sampling from an initial distribution is part of the experiment, then the results are statistics (even for a sample size of 1). But if you are making a measurement on a fixed set of data, the measurement is not a statistic -- it's simply a measurement.

If a colleague sends you a series of numbers with no explanation and asks for the linear regression, you are not going to tell him: "Sir, the sample mean of the slope based on 1 sample is 5". No, you're going to just tell him, "The data set has a slope of 5."

Now, it may be the case that this data was randomly generated by your colleague; in which case, he will record your measurement and conclude that the sample mean of the slope based on 1 sample is 5. However, from your perspective, this was an isolated experiment and the slope was not a statistic. Or, it could be the case that the data was not randomly generated, but is instead a permutation of the digits of the constant Pi.

Now, by your logic, perspective is irrelevant, and we should view every number in the world as a statistic. But that doesn't make sense. If you have 1 daughter, you would not say "The sample mean of the number of daughters I have is 1," because from your perspective, this number was not drawn randomly.

In my case, it makes no difference where the original data set came from. It so happens that I did generate the data randomly, but since I was not interested in measuring statistics, I chose the perspective that it was a single fixed data set and hence I do not refer to the results as a statistic.

Just like when sylas draw an L-configuration of 3 points, he referred to the line passing through them as having fixed parameters, he didn't say "The sample mean of the slope of the line is...X"

In short, we can choose to make statistical measurements whenever we like, and when we do, we refer to those statistical measurements as statistics only relative to the specific statistical experiment.

I'm not even sure what you mean here - it makes no sense.

What part confuses you?
 
  • #33
""Sir, the sample mean of the slope based on 1 sample is 5""

Since we never speak of the sample mean of a slope, this statement is meaningless. When you calculate a regression, if the estimates are the only item of interest, you are using regression in a very narrow, reckless, rather foolish and dangerous manner. You should always be interested in the accuracy of your (estimated) slope - whether you do a formal test, or confidence interval, whatever.

"However, from your perspective, this was an isolated experiment and the slope was not a statistic"
No, the slope here is a statistic - generated from data.

"...we should view every number in the world as a statistic. "

I never said, nor implied, any such thing.
 
  • #34
junglebeast, I did not ask for the t statistics because I did not trust your estimates, nor did I want them so I could see whether the coefficients of the regular (direct, forward, etc.) regression are statistically different from the coefficients on the reverse regression. I just wanted to verify that you were getting the same the t statistic for the slope coefficient between the two regressions.
 
  • #35
EnumaElish said:
junglebeast, I did not ask for the t statistics because I did not trust your estimates, nor did I want them so I could see whether the coefficients of the regular (direct, forward, etc.) regression are statistically different from the coefficients on the reverse regression. I just wanted to verify that you were getting the same the t statistic for the slope coefficient between the two regressions.

Thanks for clarifying your intent.
 

Similar threads

  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 8 ·
Replies
8
Views
3K
Replies
6
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 64 ·
3
Replies
64
Views
5K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 21 ·
Replies
21
Views
3K