Linear Regression: reversing the roles of X and Y

Click For Summary
SUMMARY

This discussion centers on the mathematical implications of reversing the roles of X and Y in simple linear regression. The standard form of linear regression is expressed as Y = β0 + β1 * X + ε, where β0 and β1 are least-squares estimates. Participants explore whether the coefficients obtained from regressing Y on X can be directly derived from those obtained by regressing X on Y. The consensus is that while the fitted lines may appear similar under certain conditions, the coefficients are not equivalent in general, particularly when considering the randomness of variables.

PREREQUISITES
  • Understanding of simple linear regression and its mathematical formulation
  • Familiarity with least-squares estimation techniques
  • Knowledge of statistical concepts such as R-squared and correlation coefficients
  • Basic programming skills for implementing regression algorithms
NEXT STEPS
  • Study the mathematical foundations of inverse regression techniques
  • Learn about the implications of variable randomness in regression analysis
  • Explore statistical software tools for performing linear regression, such as R or Python's scikit-learn
  • Investigate the differences between minimizing Y-error and minimizing X-error in regression contexts
USEFUL FOR

Data scientists, statisticians, and researchers interested in regression analysis and its mathematical properties will benefit from this discussion.

  • #31
"I did not generate multiple data sets and measure m and b across them, so it is not valid to say that m and b are statistics in my analysis; they are simply dependent variables which were chosen such that, when treated as constants in a linear equation, a certain criterion is minimized."

Not true at all. Whenever you have any set of random data, collected or generated, the slope and intercept calculated from least squares are statistics. It may be awkward, or extremely difficult, to define a population, but they are statistics nevertheless. If you are saying they don't have a distribution because these values are based on one sample: we think of these just as we do every statistic: specific realizations of a random quantity.

"Although one could choose to generate multiple data sets, and look at the distribution of the m and b statistics across those data sets, this would not be useful in any way ..."
The distributions of the slope and intercept are conceptualized just as the distributions of sample means, standard deviations, etc. In textbooks all distributions in these situations are normal or t, in real life not so much, but the idea is the same.

"Therefore, the full class of available data sets has infinite range, and therefore cannot be randomly sampled and has no distribution. Even if you restricted the analysis to a fixed class of problems, say, where Y = mx + b + \epsilon , then you still have an infinite range of parameters which cannot be sampled!"

I'm not even sure what you mean here - it makes no sense.
 
Physics news on Phys.org
  • #32
statdad said:
"I did not generate multiple data sets and measure m and b across them, so it is not valid to say that m and b are statistics in my analysis; they are simply dependent variables which were chosen such that, when treated as constants in a linear equation, a certain criterion is minimized."

Not true at all. Whenever you have any set of random data, collected or generated, the slope and intercept calculated from least squares are statistics. It may be awkward, or extremely difficult, to define a population, but they are statistics nevertheless. If you are saying they don't have a distribution because these values are based on one sample: we think of these just as we do every statistic: specific realizations of a random quantity.

I'm sorry to continue arguing this, but what you are saying is simply not true. The definition of a statistic is relative to the experiment. If the sampling from an initial distribution is part of the experiment, then the results are statistics (even for a sample size of 1). But if you are making a measurement on a fixed set of data, the measurement is not a statistic -- it's simply a measurement.

If a colleague sends you a series of numbers with no explanation and asks for the linear regression, you are not going to tell him: "Sir, the sample mean of the slope based on 1 sample is 5". No, you're going to just tell him, "The data set has a slope of 5."

Now, it may be the case that this data was randomly generated by your colleague; in which case, he will record your measurement and conclude that the sample mean of the slope based on 1 sample is 5. However, from your perspective, this was an isolated experiment and the slope was not a statistic. Or, it could be the case that the data was not randomly generated, but is instead a permutation of the digits of the constant Pi.

Now, by your logic, perspective is irrelevant, and we should view every number in the world as a statistic. But that doesn't make sense. If you have 1 daughter, you would not say "The sample mean of the number of daughters I have is 1," because from your perspective, this number was not drawn randomly.

In my case, it makes no difference where the original data set came from. It so happens that I did generate the data randomly, but since I was not interested in measuring statistics, I chose the perspective that it was a single fixed data set and hence I do not refer to the results as a statistic.

Just like when sylas draw an L-configuration of 3 points, he referred to the line passing through them as having fixed parameters, he didn't say "The sample mean of the slope of the line is...X"

In short, we can choose to make statistical measurements whenever we like, and when we do, we refer to those statistical measurements as statistics only relative to the specific statistical experiment.

I'm not even sure what you mean here - it makes no sense.

What part confuses you?
 
  • #33
""Sir, the sample mean of the slope based on 1 sample is 5""

Since we never speak of the sample mean of a slope, this statement is meaningless. When you calculate a regression, if the estimates are the only item of interest, you are using regression in a very narrow, reckless, rather foolish and dangerous manner. You should always be interested in the accuracy of your (estimated) slope - whether you do a formal test, or confidence interval, whatever.

"However, from your perspective, this was an isolated experiment and the slope was not a statistic"
No, the slope here is a statistic - generated from data.

"...we should view every number in the world as a statistic. "

I never said, nor implied, any such thing.
 
  • #34
junglebeast, I did not ask for the t statistics because I did not trust your estimates, nor did I want them so I could see whether the coefficients of the regular (direct, forward, etc.) regression are statistically different from the coefficients on the reverse regression. I just wanted to verify that you were getting the same the t statistic for the slope coefficient between the two regressions.
 
  • #35
EnumaElish said:
junglebeast, I did not ask for the t statistics because I did not trust your estimates, nor did I want them so I could see whether the coefficients of the regular (direct, forward, etc.) regression are statistically different from the coefficients on the reverse regression. I just wanted to verify that you were getting the same the t statistic for the slope coefficient between the two regressions.

Thanks for clarifying your intent.
 

Similar threads

  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 6 ·
Replies
6
Views
3K
Replies
3
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 64 ·
3
Replies
64
Views
5K
  • · Replies 8 ·
Replies
8
Views
3K