Coefficient of Determination in case of repeat points, in linear regression

In summary, the coefficient of determination, R^2, is strictly less than 1 if there are repeat points in the data. This can be shown through a heuristic proof that demonstrates how the regression line cannot go through all data points if there are repeat points present. Therefore, R^2 < 1 in this case.
  • #1
maverick280857
1,789
4
Hello,

In simple linear regression (or even in multiple linear regression) how does one prove that the coefficient of determination, given by

[tex]R^2 = \frac{SS_{Reg}}{SS_{Total}} = 1-\frac{SS_{Res}}{SS_{Total}}= 1-\frac{\sum_{i=1}^{n}(y_i-\hat{y}_i)^2}{\sum_{i=1}^{n}(y_i-\overline{y})^2}[/tex]

is strictly less than 1, if there are repeat points? That is, if there are multiple values of the response [itex]y_i[/itex] at one value of the regressor [itex]x_i[/itex]?

Thanks in advance.
 
Physics news on Phys.org
  • #2
Wouldn't a general proof be sufficient?
 
  • #3
Well, it is easy to see that [itex]R^2 \leq 1[/itex]. For the repeat-point case, I want to show that [itex]R^2 < 1[/itex].
 
  • #4
Ah; thanks for pointing that out.

I have the outline of a heuristic proof. For R^2 = 1 the regression line has to coincide with all data points. Also, as a general matter, the regression line y = a + b x has to go through the sample averages of (X, Y) -- that is, mean(Y) = a + b mean(X). Suppose your data are {(x1, y1), (x1, y2), (x2, y3)}, y1 is not equal to y2, and your slope coefficient satisfies -infty < b < +infty.

If b(mean(X) - x1) equals mean(Y) - y1 then Y(x1) = y1, and the regression line does not go through y2.

If b(mean(X) - x1) equals mean(Y) - y2 then Y(x1) = y2, and the regression line does not go through y1.

If b(mean(X) - x1) equals neither mean(Y) - y1 nor mean(Y) - y2 then the regression line does not go through y1 or y2.
 
Last edited:
  • #5


Thank you for your question. The coefficient of determination, also known as R^2, is a measure of how well the regression line fits the data. It ranges from 0 to 1, with 1 indicating a perfect fit and 0 indicating no relationship between the variables.

In the case of repeat points, where there are multiple values of the response at one value of the regressor, the R^2 value will still be less than 1. This is because the formula for R^2 includes the sum of squared residuals, which is the difference between the observed values and the predicted values. When there are repeat points, there will be more than one predicted value for the same observed value, resulting in a larger sum of squared residuals.

To prove that the R^2 value will always be less than 1 in the case of repeat points, we can consider the extreme scenario where all the observed values are the same. In this case, the regression line will be a horizontal line passing through the mean of the response variable. This means that the predicted values will also be the same for all the observed values. As a result, the sum of squared residuals will be 0, making the R^2 value equal to 1. However, in the case of repeat points, the sum of squared residuals will be greater than 0, resulting in an R^2 value that is less than 1.

In summary, the R^2 value will always be less than 1 in the case of repeat points because the formula takes into account the sum of squared residuals, which will be larger when there are multiple predicted values for the same observed value.
 

1. What is the coefficient of determination in linear regression?

The coefficient of determination, also known as R-squared, is a statistical measure that indicates the proportion of the variance in the dependent variable that is explained by the independent variable(s) in a linear regression model.

2. How is the coefficient of determination calculated?

The coefficient of determination is calculated by dividing the explained sum of squares (SSreg) by the total sum of squares (SStot). It is typically expressed as a percentage and ranges from 0 to 100%, with higher values indicating a stronger relationship between the variables.

3. What does a coefficient of determination of 1 represent?

A coefficient of determination of 1 (or 100%) indicates that all of the variation in the dependent variable can be explained by the independent variable(s) in the linear regression model. This means that the model fits the data perfectly and there is no error.

4. Can the coefficient of determination be negative?

No, the coefficient of determination cannot be negative. It is always a value between 0 and 1 (or 0% and 100%). A negative value would indicate that the model is worse than using the mean of the dependent variable to make predictions.

5. What does a low coefficient of determination indicate?

A low coefficient of determination indicates that the independent variable(s) in the linear regression model do not explain much of the variation in the dependent variable. This could mean that the model is not a good fit for the data or that there are other factors influencing the dependent variable.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
697
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
527
  • Set Theory, Logic, Probability, Statistics
2
Replies
64
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
815
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
2K
Back
Top