R-squared statistic for goodness-of-fit

EngWiPy · Nov 28, 2017

Hello all,

While I was reading about linear regression, I stumbled on the concept r-squared statistic that measures the goodness-of-fit of the line to the data points. It is defined as:

[tex]R^2 = 1 - \frac{\sum_i (y_i - f(x_i))^2}{\sum_i (y_i - E[y])^2}[/tex]

where f(x_i) is the fitted/predicted response value due to x_i, y_i is the actual observed response variable, and E[y] is the expected value of {y}_i.

It is said that this statistic falls between 0 and 1. I can understand why r-squared could be 1 (it means that y_i = f(x_i), i.e., the line fits the data points exactly), but how could r-squared be 0? This implies, I think, that the maximum variation of y is around its mean, and thus the numerator cannot exceeds that value? Is this true?

Thanks

StoneTemplePython · Nov 28, 2017

I'd start with a simpler question:

do you know what a correlation coefficient is? Or even more simply: do you know what covariance is? How can you have zero covarariance? note that these build on each other:

##cov(X,Y) \to \rho(X,Y) \to R^2_{X,Y,...}##

EngWiPy · Nov 28, 2017

StoneTemplePython said:

I'd start with a simpler question:

do you know what a correlation coefficient is? Or even more simply: do you know what covariance is? How can you have zero covarariance?note that these build on each other:

##cov(X,Y) \to \rho(X,Y) \to R^2_{X,Y,...}##

Thanks. I know the denominator is proportional to the variance of y, but what about the numerator? I still don't see the connection. How these are related?

StoneTemplePython · Nov 28, 2017

##R^2## generalizes the correlation coefficient.

https://en.wikipedia.org/wiki/Coefficient_of_determination

wikipedia said:

As squared correlation coefficient
In linear least squares regression with an estimated intercept term, R2 equals the square of the Pearson correlation coefficient between the observed y and modeled (predicted) f data values of the dependent variable.

If you understand covariance, and when you can get a zero there, then this leads directly to getting the answer for your question. You can further generalize from here as needed.

EngWiPy · Nov 29, 2017

In the Wikipedia page it says:

The better the linear regression (on the right) fits the data in comparison to the simple average (on the left graph), the closer the value of ##R^2## is to 1

I think I understand now why ##R^2## could be 0. In calculating the parameters of the line to fit the data, we minimized the sum of errors squared. This means that:

[tex]\sum_i(y_i - f(x_i))^2 \leq \sum_i (y_i - E[y])^2[/tex]

with worst case scenario is when the equality holds. I think the correlation coefficient is another way of understanding the coefficient of determination.

R-squared statistic for goodness-of-fit

1. What is the R-squared statistic for goodness-of-fit?

2. How is the R-squared statistic calculated?

3. What does a high R-squared value indicate?

4. Can the R-squared statistic be negative?

5. Are there any limitations to using the R-squared statistic for goodness-of-fit?

Similar threads

Hot Threads

Recent Insights