# Comparing SD of data with RMSE of regression line

1. Mar 17, 2016

### phosgene

1. The problem statement, all variables and given/known data

I'm being asked to compare the standard deviation of a data set with the root mean square error of the regression line used to model the data, in order to determine the reliability of the regression line.

2. Relevant equations

Mean squared error = variance + bias squared

3. The attempt at a solution

I've done some googling and found that the MSE equals the variance + bias squared. So if I'm understanding this correctly, if the regression line is reliable, the bias should be low and hence the RMSE should be approximately equal to the SD?

My main hang up with this is that if the data just happens to fall upon a straight line, it can have a non-zero standard deviation, but the line used to model the data will describe it perfectly and hence have a RMSE of zero.

My head is really spinning at this..

2. Mar 17, 2016

### andrewkirk

That's a different standard deviation you're referring to there - the standard deviation of the Y values (observed values of dependent variable). The standard deviation (or variance) in the formula you quote above is the standard deviation of residuals, not of the Y values. The residual is defined as the difference between the observed value and the estimated value. The estimated value is the value on the line for a given value of X, ie $\beta_1X+\beta_0$ where $\beta_1$ and $\beta_0$ are the fitted coefficients - the slope and Y intercept. If all the values lie on a non-horizontal line, the standard deviation of residuals will be zero but the std dev of the Y values will not be.

3. Mar 18, 2016

### phosgene

Thanks for the reply. Does this mean the standard deviation of my data points isn't related to the RMSE of the regression line? The question is specifically referring to that SD. And it's very similar to the value for the RMSE of the regression line. As far as I can tell, this will say nothing about the reliability of the regression line. I can picture good fits and bad fits for which the SD of the data and RMSE of the regression line are similar..

4. Mar 18, 2016

### andrewkirk

No, they are related by the correlation coefficient $r$ estimated by the regression. See this page.

5. Mar 18, 2016

### phosgene

Ah, I think I got it now. Thanks!

6. Mar 18, 2016

### Ray Vickson

The SD of your y-values comes from two sources: (1) variations in the x-values, leading to different y-values; and (2) random errors. Both of these are present, in general.

Even if you have a perfect fit (with no residual errors) there will be an SD in the y-values, just because they are spread out as a result of the x-values being spread out. In general, though, the SD will be larger than that arising solely from the spread of the x-values That is, there are some residual errors.

Just looking at some measure of the magnitude of the residual errors does not tell you if you have an appropriate "fit": the errors can be large just because there is a lot of noise in the data. To test if the model is appropriate, you need to actually compare two different models to see if one is significantly better than the other. For example, if you have a theoretical model of the form $Y = f(x) + \epsilon$ and a corresponding data set $y_i = f(x_i) + \epsilon_i, i=1,2, \ldots, n$ your aim might be to discover the form of the unknown function $f(x)$. Perhaps you want to see whether a linear fit $f(x) = \alpha + \beta x$ is OK, or whether you need a quadratic fit $f(x) = \alpha + \beta x + \gamma x^2$. One way would be to perform two regressions---one for the linear function and one for the quadratic---then compare their rms errors, etc. Another way would be to perform a significance test on the coefficients of the quadratic fit, to see if the quadratic term is genuine or merely an artifact of the random errors.

7. Mar 18, 2016

### phosgene

Ok, I went back and plugged the numbers into the relation andrewkirk pointed me to:

$RMSE=\sqrt{1-r^2}SD_{y}$

The value for the RMSE is ~16.5, whereas the value for SD of the data is ~16.7. The correlation coefficient r is given as 0.6516. With these numbers, the right side of the equation evaluates to about 12.7.

Something seems very wrong. Since this is quickly becoming way more complicated than I think it should be (it's a non-mathematical course on business analytics - the equation for RMSE wasn't given and I'm pretty sure that the SD wasn't either), I'm going to leave it for now and wait for the lecturer to get back to me.