Coefficient of Determination in case of repeat points, in linear regression

Click For Summary

Discussion Overview

The discussion centers on the coefficient of determination (R²) in the context of linear regression, specifically addressing the scenario where there are repeat points in the dataset. Participants explore the implications of having multiple response values for a single regressor value on the value of R², seeking to understand why R² is strictly less than 1 in such cases.

Discussion Character

  • Technical explanation
  • Mathematical reasoning
  • Debate/contested

Main Points Raised

  • One participant asks how to prove that R² is strictly less than 1 when there are repeat points in the dataset.
  • Another participant suggests that a general proof would suffice for understanding the behavior of R².
  • A participant acknowledges that while R² is generally less than or equal to 1, they specifically want to demonstrate that R² is less than 1 in the case of repeat points.
  • A heuristic proof is outlined, indicating that for R² to equal 1, the regression line must coincide with all data points, which is not possible when there are multiple y-values for a single x-value.
  • The participant elaborates on the relationship between the regression line and the sample averages, explaining that the regression line cannot pass through multiple y-values at the same x-value unless certain conditions are met, which are not satisfied in the case of repeat points.

Areas of Agreement / Disagreement

Participants do not reach a consensus on a definitive proof but engage in a discussion that highlights different aspects of the problem. There is an acknowledgment of the complexity involved in proving the claim regarding R² and repeat points.

Contextual Notes

The discussion involves assumptions about the behavior of regression lines and the implications of having repeated data points, but these assumptions are not fully resolved or universally accepted among participants.

maverick280857
Messages
1,774
Reaction score
5
Hello,

In simple linear regression (or even in multiple linear regression) how does one prove that the coefficient of determination, given by

[tex]R^2 = \frac{SS_{Reg}}{SS_{Total}} = 1-\frac{SS_{Res}}{SS_{Total}}= 1-\frac{\sum_{i=1}^{n}(y_i-\hat{y}_i)^2}{\sum_{i=1}^{n}(y_i-\overline{y})^2}[/tex]

is strictly less than 1, if there are repeat points? That is, if there are multiple values of the response [itex]y_i[/itex] at one value of the regressor [itex]x_i[/itex]?

Thanks in advance.
 
Physics news on Phys.org
Wouldn't a general proof be sufficient?
 
Well, it is easy to see that [itex]R^2 \leq 1[/itex]. For the repeat-point case, I want to show that [itex]R^2 < 1[/itex].
 
Ah; thanks for pointing that out.

I have the outline of a heuristic proof. For R^2 = 1 the regression line has to coincide with all data points. Also, as a general matter, the regression line y = a + b x has to go through the sample averages of (X, Y) -- that is, mean(Y) = a + b mean(X). Suppose your data are {(x1, y1), (x1, y2), (x2, y3)}, y1 is not equal to y2, and your slope coefficient satisfies -infty < b < +infty.

If b(mean(X) - x1) equals mean(Y) - y1 then Y(x1) = y1, and the regression line does not go through y2.

If b(mean(X) - x1) equals mean(Y) - y2 then Y(x1) = y2, and the regression line does not go through y1.

If b(mean(X) - x1) equals neither mean(Y) - y1 nor mean(Y) - y2 then the regression line does not go through y1 or y2.
 
Last edited:

Similar threads

  • · Replies 8 ·
Replies
8
Views
3K
Replies
3
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 64 ·
3
Replies
64
Views
6K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
Replies
1
Views
4K
  • · Replies 11 ·
Replies
11
Views
5K