Comparing SD of data with RMSE of regression line

Click For Summary

Homework Help Overview

The discussion centers around comparing the standard deviation of a data set with the root mean square error (RMSE) of a regression line to assess the reliability of the regression model. The participants explore the implications of these statistical measures within the context of regression analysis.

Discussion Character

  • Conceptual clarification, Assumption checking, Problem interpretation

Approaches and Questions Raised

  • Participants discuss the relationship between the standard deviation of the data points and the RMSE of the regression line, questioning how these metrics reflect the reliability of the regression model. There are attempts to clarify the distinction between the standard deviation of the observed values and the standard deviation of residuals.

Discussion Status

The conversation is ongoing, with some participants providing clarifications and others expressing confusion about the relationship between the standard deviation and RMSE. There is an acknowledgment of the complexity of the topic, and some participants are considering waiting for further guidance from the lecturer.

Contextual Notes

Participants note that the course is non-mathematical and that specific equations for RMSE and standard deviation were not provided, which may contribute to the confusion in the discussion.

phosgene
Messages
145
Reaction score
1

Homework Statement



I'm being asked to compare the standard deviation of a data set with the root mean square error of the regression line used to model the data, in order to determine the reliability of the regression line.

Homework Equations



Mean squared error = variance + bias squared

The Attempt at a Solution



I've done some googling and found that the MSE equals the variance + bias squared. So if I'm understanding this correctly, if the regression line is reliable, the bias should be low and hence the RMSE should be approximately equal to the SD?

My main hang up with this is that if the data just happens to fall upon a straight line, it can have a non-zero standard deviation, but the line used to model the data will describe it perfectly and hence have a RMSE of zero.

My head is really spinning at this..
 
Physics news on Phys.org
phosgene said:
if the data just happens to fall upon a straight line, it can have a non-zero standard deviation, but the line used to model the data will describe it perfectly and hence have a RMSE of zero.
That's a different standard deviation you're referring to there - the standard deviation of the Y values (observed values of dependent variable). The standard deviation (or variance) in the formula you quote above is the standard deviation of residuals, not of the Y values. The residual is defined as the difference between the observed value and the estimated value. The estimated value is the value on the line for a given value of X, ie ##\beta_1X+\beta_0## where ##\beta_1## and ##\beta_0## are the fitted coefficients - the slope and Y intercept. If all the values lie on a non-horizontal line, the standard deviation of residuals will be zero but the std dev of the Y values will not be.
 
  • Like
Likes   Reactions: phosgene
andrewkirk said:
That's a different standard deviation you're referring to there - the standard deviation of the Y values (observed values of dependent variable). The standard deviation (or variance) in the formula you quote above is the standard deviation of residuals, not of the Y values. The residual is defined as the difference between the observed value and the estimated value. The estimated value is the value on the line for a given value of X, ie ##\beta_1X+\beta_0## where ##\beta_1## and ##\beta_0## are the fitted coefficients - the slope and Y intercept. If all the values lie on a non-horizontal line, the standard deviation of residuals will be zero but the std dev of the Y values will not be.

Thanks for the reply. Does this mean the standard deviation of my data points isn't related to the RMSE of the regression line? The question is specifically referring to that SD. And it's very similar to the value for the RMSE of the regression line. As far as I can tell, this will say nothing about the reliability of the regression line. I can picture good fits and bad fits for which the SD of the data and RMSE of the regression line are similar..
 
phosgene said:
Does this mean the standard deviation of my data points isn't related to the RMSE of the regression line?
No, they are related by the correlation coefficient ##r## estimated by the regression. See http://statweb.stanford.edu/~susan/courses/s60/split/node60.html.
 
  • Like
Likes   Reactions: phosgene
Ah, I think I got it now. Thanks!
 
phosgene said:
Thanks for the reply. Does this mean the standard deviation of my data points isn't related to the RMSE of the regression line? The question is specifically referring to that SD. And it's very similar to the value for the RMSE of the regression line. As far as I can tell, this will say nothing about the reliability of the regression line. I can picture good fits and bad fits for which the SD of the data and RMSE of the regression line are similar..

The SD of your y-values comes from two sources: (1) variations in the x-values, leading to different y-values; and (2) random errors. Both of these are present, in general.

Even if you have a perfect fit (with no residual errors) there will be an SD in the y-values, just because they are spread out as a result of the x-values being spread out. In general, though, the SD will be larger than that arising solely from the spread of the x-values That is, there are some residual errors.

Just looking at some measure of the magnitude of the residual errors does not tell you if you have an appropriate "fit": the errors can be large just because there is a lot of noise in the data. To test if the model is appropriate, you need to actually compare two different models to see if one is significantly better than the other. For example, if you have a theoretical model of the form ##Y = f(x) + \epsilon## and a corresponding data set ##y_i = f(x_i) + \epsilon_i, i=1,2, \ldots, n## your aim might be to discover the form of the unknown function ##f(x)##. Perhaps you want to see whether a linear fit ##f(x) = \alpha + \beta x## is OK, or whether you need a quadratic fit ##f(x) = \alpha + \beta x + \gamma x^2##. One way would be to perform two regressions---one for the linear function and one for the quadratic---then compare their rms errors, etc. Another way would be to perform a significance test on the coefficients of the quadratic fit, to see if the quadratic term is genuine or merely an artifact of the random errors.
 
  • Like
Likes   Reactions: phosgene
Ok, I went back and plugged the numbers into the relation andrewkirk pointed me to:

RMSE=\sqrt{1-r^2}SD_{y}

The value for the RMSE is ~16.5, whereas the value for SD of the data is ~16.7. The correlation coefficient r is given as 0.6516. With these numbers, the right side of the equation evaluates to about 12.7.

Something seems very wrong. Since this is quickly becoming way more complicated than I think it should be (it's a non-mathematical course on business analytics - the equation for RMSE wasn't given and I'm pretty sure that the SD wasn't either), I'm going to leave it for now and wait for the lecturer to get back to me.
 

Similar threads

  • · Replies 11 ·
Replies
11
Views
5K
Replies
3
Views
11K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 42 ·
2
Replies
42
Views
6K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 9 ·
Replies
9
Views
4K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 64 ·
3
Replies
64
Views
6K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K