Linear Regression: reversing the roles of X and Y

statdad · Jun 9, 2009

"I did not generate multiple data sets and measure m and b across them, so it is not valid to say that m and b are statistics in my analysis; they are simply dependent variables which were chosen such that, when treated as constants in a linear equation, a certain criterion is minimized."

Not true at all. Whenever you have any set of random data, collected or generated, the slope and intercept calculated from least squares are statistics. It may be awkward, or extremely difficult, to define a population, but they are statistics nevertheless. If you are saying they don't have a distribution because these values are based on one sample: we think of these just as we do every statistic: specific realizations of a random quantity.

"Although one could choose to generate multiple data sets, and look at the distribution of the m and b statistics across those data sets, this would not be useful in any way ..."
The distributions of the slope and intercept are conceptualized just as the distributions of sample means, standard deviations, etc. In textbooks all distributions in these situations are normal or t, in real life not so much, but the idea is the same.

"Therefore, the full class of available data sets has infinite range, and therefore cannot be randomly sampled and has no distribution. Even if you restricted the analysis to a fixed class of problems, say, where Y = mx + b + \epsilon , then you still have an infinite range of parameters which cannot be sampled!"

I'm not even sure what you mean here - it makes no sense.

junglebeast · Jun 9, 2009

statdad said:

"I did not generate multiple data sets and measure m and b across them, so it is not valid to say that m and b are statistics in my analysis; they are simply dependent variables which were chosen such that, when treated as constants in a linear equation, a certain criterion is minimized."

Not true at all. Whenever you have any set of random data, collected or generated, the slope and intercept calculated from least squares are statistics. It may be awkward, or extremely difficult, to define a population, but they are statistics nevertheless. If you are saying they don't have a distribution because these values are based on one sample: we think of these just as we do every statistic: specific realizations of a random quantity.

I'm sorry to continue arguing this, but what you are saying is simply not true. The definition of a statistic is relative to the experiment. If the sampling from an initial distribution is part of the experiment, then the results are statistics (even for a sample size of 1). But if you are making a measurement on a fixed set of data, the measurement is not a statistic -- it's simply a measurement.

If a colleague sends you a series of numbers with no explanation and asks for the linear regression, you are not going to tell him: "Sir, the sample mean of the slope based on 1 sample is 5". No, you're going to just tell him, "The data set has a slope of 5."

Now, it may be the case that this data was randomly generated by your colleague; in which case, he will record your measurement and conclude that the sample mean of the slope based on 1 sample is 5. However, from your perspective, this was an isolated experiment and the slope was not a statistic. Or, it could be the case that the data was not randomly generated, but is instead a permutation of the digits of the constant Pi.

Now, by your logic, perspective is irrelevant, and we should view every number in the world as a statistic. But that doesn't make sense. If you have 1 daughter, you would not say "The sample mean of the number of daughters I have is 1," because from your perspective, this number was not drawn randomly.

In my case, it makes no difference where the original data set came from. It so happens that I did generate the data randomly, but since I was not interested in measuring statistics, I chose the perspective that it was a single fixed data set and hence I do not refer to the results as a statistic.

Just like when sylas draw an L-configuration of 3 points, he referred to the line passing through them as having fixed parameters, he didn't say "The sample mean of the slope of the line is...X"

In short, we can choose to make statistical measurements whenever we like, and when we do, we refer to those statistical measurements as statistics only relative to the specific statistical experiment.

I'm not even sure what you mean here - it makes no sense.

What part confuses you?

statdad · Jun 9, 2009

""Sir, the sample mean of the slope based on 1 sample is 5""

Since we never speak of the sample mean of a slope, this statement is meaningless. When you calculate a regression, if the estimates are the only item of interest, you are using regression in a very narrow, reckless, rather foolish and dangerous manner. You should always be interested in the accuracy of your (estimated) slope - whether you do a formal test, or confidence interval, whatever.

"However, from your perspective, this was an isolated experiment and the slope was not a statistic"
No, the slope here is a statistic - generated from data.

"...we should view every number in the world as a statistic. "

I never said, nor implied, any such thing.

EnumaElish · Jun 9, 2009

junglebeast, I did not ask for the t statistics because I did not trust your estimates, nor did I want them so I could see whether the coefficients of the regular (direct, forward, etc.) regression are statistically different from the coefficients on the reverse regression. I just wanted to verify that you were getting the same the t statistic for the slope coefficient between the two regressions.

junglebeast · Jun 9, 2009

EnumaElish said:

junglebeast, I did not ask for the t statistics because I did not trust your estimates, nor did I want them so I could see whether the coefficients of the regular (direct, forward, etc.) regression are statistically different from the coefficients on the reverse regression. I just wanted to verify that you were getting the same the t statistic for the slope coefficient between the two regressions.

Thanks for clarifying your intent.

Linear Regression: reversing the roles of X and Y

Similar threads

B A Little Probability Puzzle

I A variant of the Monty Hall problem

I What Are the Axioms of Fuzzy Logic and How Do They Extend Boolean Algebra?

I Please Explain (actually explain) The Monty Hall Problem

B How Rare Is Low Smartphone Usage Among Metro Travelers in Japan?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers