R-squared statistic for goodness-of-fit

  • I
  • Thread starter EngWiPy
  • Start date
  • Tags
    Statistic
In summary, the concept of r-squared statistic measures the goodness-of-fit of a line to data points in linear regression. It falls between 0 and 1, with 1 indicating a perfect fit and 0 indicating no correlation. The numerator of the r-squared formula is related to the covariance between the observed and predicted values, while the denominator is proportional to the variance of the observed values. When the sum of squared errors is minimized, the value of r-squared will be closer to 1.
  • #1
EngWiPy
1,368
61
Hello all,

While I was reading about linear regression, I stumbled on the concept r-squared statistic that measures the goodness-of-fit of the line to the data points. It is defined as:

[tex]R^2 = 1 - \frac{\sum_i (y_i - f(x_i))^2}{\sum_i (y_i - E[y])^2}[/tex]

where f(x_i) is the fitted/predicted response value due to x_i, y_i is the actual observed response variable, and E[y] is the expected value of {y}_i.

It is said that this statistic falls between 0 and 1. I can understand why r-squared could be 1 (it means that y_i = f(x_i), i.e., the line fits the data points exactly), but how could r-squared be 0? This implies, I think, that the maximum variation of y is around its mean, and thus the numerator cannot exceeds that value? Is this true?

Thanks
 
Physics news on Phys.org
  • #2
I'd start with a simpler question:

do you know what a correlation coefficient is? Or even more simply: do you know what covariance is? How can you have zero covarariance? note that these build on each other:

##cov(X,Y) \to \rho(X,Y) \to R^2_{X,Y,...}##
 
  • #3
StoneTemplePython said:
I'd start with a simpler question:

do you know what a correlation coefficient is? Or even more simply: do you know what covariance is? How can you have zero covarariance?note that these build on each other:

##cov(X,Y) \to \rho(X,Y) \to R^2_{X,Y,...}##

Thanks. I know the denominator is proportional to the variance of y, but what about the numerator? I still don't see the connection. How these are related?
 
  • #4
##R^2## generalizes the correlation coefficient.

https://en.wikipedia.org/wiki/Coefficient_of_determination

wikipedia said:
As squared correlation coefficient
In linear least squares regression with an estimated intercept term, R2 equals the square of the Pearson correlation coefficient between the observed y and modeled (predicted) f data values of the dependent variable.
If you understand covariance, and when you can get a zero there, then this leads directly to getting the answer for your question. You can further generalize from here as needed.
 
  • #5
In the Wikipedia page it says:

The better the linear regression (on the right) fits the data in comparison to the simple average (on the left graph), the closer the value of ##R^2## is to 1

I think I understand now why ##R^2## could be 0. In calculating the parameters of the line to fit the data, we minimized the sum of errors squared. This means that:

[tex]\sum_i(y_i - f(x_i))^2 \leq \sum_i (y_i - E[y])^2[/tex]

with worst case scenario is when the equality holds. I think the correlation coefficient is another way of understanding the coefficient of determination.
 

1. What is the R-squared statistic for goodness-of-fit?

The R-squared statistic for goodness-of-fit is a measure of how well a statistical model fits the observed data. It is also known as the coefficient of determination and is represented by the symbol R^2. This statistic ranges from 0 to 1, with a higher value indicating a better fit between the model and the data.

2. How is the R-squared statistic calculated?

The R-squared statistic is calculated by taking the ratio of the sum of squared errors (SSE) to the total sum of squares (SST). This can be expressed as R^2 = 1 - (SSE/SST). The SSE is the sum of the squared differences between the observed data and the predicted values from the model. The SST is the sum of the squared differences between the observed data and the mean of the data.

3. What does a high R-squared value indicate?

A high R-squared value indicates that the model is a good fit for the data. This means that a large proportion of the variability in the data can be explained by the model. Generally, an R-squared value of 0.7 or higher is considered a good fit, but this can vary depending on the context and the type of data being analyzed.

4. Can the R-squared statistic be negative?

No, the R-squared statistic cannot be negative. It is always between 0 and 1, with 0 indicating no relationship between the model and the data, and 1 indicating a perfect fit. A negative R-squared value would suggest that the model is worse than using the mean of the data to make predictions.

5. Are there any limitations to using the R-squared statistic for goodness-of-fit?

Yes, there are some limitations to using the R-squared statistic. It only measures the strength of the relationship between the model and the data, and does not indicate whether the model itself is appropriate for the data. Additionally, it can be influenced by the number of data points and the complexity of the model. It is important to consider other measures and techniques when evaluating the overall goodness-of-fit of a statistical model.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
628
  • Set Theory, Logic, Probability, Statistics
Replies
24
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
923
  • Set Theory, Logic, Probability, Statistics
Replies
19
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
745
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
961
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
2K
Back
Top