R-squared statistic for goodness-of-fit

  • Context: Undergrad 
  • Thread starter Thread starter EngWiPy
  • Start date Start date
  • Tags Tags
    Statistic
Click For Summary

Discussion Overview

The discussion centers around the r-squared statistic as a measure of goodness-of-fit in linear regression. Participants explore its definition, implications, and the relationship between r-squared, covariance, and correlation coefficients. The scope includes theoretical understanding and mathematical reasoning related to regression analysis.

Discussion Character

  • Exploratory
  • Technical explanation
  • Mathematical reasoning

Main Points Raised

  • One participant defines r-squared and questions how it can equal zero, suggesting it indicates maximum variation around the mean.
  • Another participant prompts a simpler question about correlation and covariance, implying these concepts are foundational to understanding r-squared.
  • A later reply emphasizes that r-squared generalizes the correlation coefficient and connects it to covariance, suggesting that understanding zero covariance is key to grasping r-squared.
  • One participant references a Wikipedia page to explain that a better fit in linear regression results in r-squared values closer to 1, indicating an understanding of the relationship between the fitted line and the mean.
  • Another participant concludes that minimizing the sum of squared errors leads to the condition where r-squared could be zero, reinforcing the connection between r-squared and correlation coefficients.

Areas of Agreement / Disagreement

Participants express varying levels of understanding about the relationship between r-squared, covariance, and correlation coefficients. While some points of clarification are made, no consensus is reached on the implications of r-squared equaling zero.

Contextual Notes

Some participants rely on definitions and relationships that may not be fully explored, such as the conditions under which covariance can be zero and how this affects r-squared. The discussion does not resolve these dependencies or assumptions.

EngWiPy
Messages
1,361
Reaction score
61
Hello all,

While I was reading about linear regression, I stumbled on the concept r-squared statistic that measures the goodness-of-fit of the line to the data points. It is defined as:

R^2 = 1 - \frac{\sum_i (y_i - f(x_i))^2}{\sum_i (y_i - E[y])^2}

where f(x_i) is the fitted/predicted response value due to x_i, y_i is the actual observed response variable, and E[y] is the expected value of {y}_i.

It is said that this statistic falls between 0 and 1. I can understand why r-squared could be 1 (it means that y_i = f(x_i), i.e., the line fits the data points exactly), but how could r-squared be 0? This implies, I think, that the maximum variation of y is around its mean, and thus the numerator cannot exceeds that value? Is this true?

Thanks
 
Physics news on Phys.org
I'd start with a simpler question:

do you know what a correlation coefficient is? Or even more simply: do you know what covariance is? How can you have zero covarariance? note that these build on each other:

##cov(X,Y) \to \rho(X,Y) \to R^2_{X,Y,...}##
 
StoneTemplePython said:
I'd start with a simpler question:

do you know what a correlation coefficient is? Or even more simply: do you know what covariance is? How can you have zero covarariance?note that these build on each other:

##cov(X,Y) \to \rho(X,Y) \to R^2_{X,Y,...}##

Thanks. I know the denominator is proportional to the variance of y, but what about the numerator? I still don't see the connection. How these are related?
 
##R^2## generalizes the correlation coefficient.

https://en.wikipedia.org/wiki/Coefficient_of_determination

wikipedia said:
As squared correlation coefficient
In linear least squares regression with an estimated intercept term, R2 equals the square of the Pearson correlation coefficient between the observed y and modeled (predicted) f data values of the dependent variable.
If you understand covariance, and when you can get a zero there, then this leads directly to getting the answer for your question. You can further generalize from here as needed.
 
In the Wikipedia page it says:

The better the linear regression (on the right) fits the data in comparison to the simple average (on the left graph), the closer the value of ##R^2## is to 1

I think I understand now why ##R^2## could be 0. In calculating the parameters of the line to fit the data, we minimized the sum of errors squared. This means that:

\sum_i(y_i - f(x_i))^2 \leq \sum_i (y_i - E[y])^2

with worst case scenario is when the equality holds. I think the correlation coefficient is another way of understanding the coefficient of determination.
 

Similar threads

  • · Replies 13 ·
Replies
13
Views
2K
Replies
24
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
Replies
1
Views
4K
  • · Replies 42 ·
2
Replies
42
Views
6K
  • · Replies 19 ·
Replies
19
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 11 ·
Replies
11
Views
1K