Comparing SD of data with RMSE of regression line

In summary, the homework statement asks if the standard deviation of the data points is related to the root mean square error of the regression line used to model the data.
  • #1
phosgene
146
1

Homework Statement



I'm being asked to compare the standard deviation of a data set with the root mean square error of the regression line used to model the data, in order to determine the reliability of the regression line.

Homework Equations



Mean squared error = variance + bias squared

The Attempt at a Solution



I've done some googling and found that the MSE equals the variance + bias squared. So if I'm understanding this correctly, if the regression line is reliable, the bias should be low and hence the RMSE should be approximately equal to the SD?

My main hang up with this is that if the data just happens to fall upon a straight line, it can have a non-zero standard deviation, but the line used to model the data will describe it perfectly and hence have a RMSE of zero.

My head is really spinning at this..
 
Physics news on Phys.org
  • #2
phosgene said:
if the data just happens to fall upon a straight line, it can have a non-zero standard deviation, but the line used to model the data will describe it perfectly and hence have a RMSE of zero.
That's a different standard deviation you're referring to there - the standard deviation of the Y values (observed values of dependent variable). The standard deviation (or variance) in the formula you quote above is the standard deviation of residuals, not of the Y values. The residual is defined as the difference between the observed value and the estimated value. The estimated value is the value on the line for a given value of X, ie ##\beta_1X+\beta_0## where ##\beta_1## and ##\beta_0## are the fitted coefficients - the slope and Y intercept. If all the values lie on a non-horizontal line, the standard deviation of residuals will be zero but the std dev of the Y values will not be.
 
  • Like
Likes phosgene
  • #3
andrewkirk said:
That's a different standard deviation you're referring to there - the standard deviation of the Y values (observed values of dependent variable). The standard deviation (or variance) in the formula you quote above is the standard deviation of residuals, not of the Y values. The residual is defined as the difference between the observed value and the estimated value. The estimated value is the value on the line for a given value of X, ie ##\beta_1X+\beta_0## where ##\beta_1## and ##\beta_0## are the fitted coefficients - the slope and Y intercept. If all the values lie on a non-horizontal line, the standard deviation of residuals will be zero but the std dev of the Y values will not be.

Thanks for the reply. Does this mean the standard deviation of my data points isn't related to the RMSE of the regression line? The question is specifically referring to that SD. And it's very similar to the value for the RMSE of the regression line. As far as I can tell, this will say nothing about the reliability of the regression line. I can picture good fits and bad fits for which the SD of the data and RMSE of the regression line are similar..
 
  • #4
phosgene said:
Does this mean the standard deviation of my data points isn't related to the RMSE of the regression line?
No, they are related by the correlation coefficient ##r## estimated by the regression. See http://statweb.stanford.edu/~susan/courses/s60/split/node60.html.
 
  • Like
Likes phosgene
  • #5
Ah, I think I got it now. Thanks!
 
  • #6
phosgene said:
Thanks for the reply. Does this mean the standard deviation of my data points isn't related to the RMSE of the regression line? The question is specifically referring to that SD. And it's very similar to the value for the RMSE of the regression line. As far as I can tell, this will say nothing about the reliability of the regression line. I can picture good fits and bad fits for which the SD of the data and RMSE of the regression line are similar..

The SD of your y-values comes from two sources: (1) variations in the x-values, leading to different y-values; and (2) random errors. Both of these are present, in general.

Even if you have a perfect fit (with no residual errors) there will be an SD in the y-values, just because they are spread out as a result of the x-values being spread out. In general, though, the SD will be larger than that arising solely from the spread of the x-values That is, there are some residual errors.

Just looking at some measure of the magnitude of the residual errors does not tell you if you have an appropriate "fit": the errors can be large just because there is a lot of noise in the data. To test if the model is appropriate, you need to actually compare two different models to see if one is significantly better than the other. For example, if you have a theoretical model of the form ##Y = f(x) + \epsilon## and a corresponding data set ##y_i = f(x_i) + \epsilon_i, i=1,2, \ldots, n## your aim might be to discover the form of the unknown function ##f(x)##. Perhaps you want to see whether a linear fit ##f(x) = \alpha + \beta x## is OK, or whether you need a quadratic fit ##f(x) = \alpha + \beta x + \gamma x^2##. One way would be to perform two regressions---one for the linear function and one for the quadratic---then compare their rms errors, etc. Another way would be to perform a significance test on the coefficients of the quadratic fit, to see if the quadratic term is genuine or merely an artifact of the random errors.
 
  • Like
Likes phosgene
  • #7
Ok, I went back and plugged the numbers into the relation andrewkirk pointed me to:

[itex]RMSE=\sqrt{1-r^2}SD_{y}[/itex]

The value for the RMSE is ~16.5, whereas the value for SD of the data is ~16.7. The correlation coefficient r is given as 0.6516. With these numbers, the right side of the equation evaluates to about 12.7.

Something seems very wrong. Since this is quickly becoming way more complicated than I think it should be (it's a non-mathematical course on business analytics - the equation for RMSE wasn't given and I'm pretty sure that the SD wasn't either), I'm going to leave it for now and wait for the lecturer to get back to me.
 

What is the difference between SD of data and RMSE of regression line?

The SD (Standard Deviation) of data is a measure of how spread out the data points are from the mean. It is calculated by taking the square root of the average of the squared differences from the mean. On the other hand, RMSE (Root Mean Square Error) of a regression line is a measure of how well the regression line fits the data. It is calculated by taking the square root of the average of the squared differences between the actual data points and the predicted values from the regression line.

Which one should be used to evaluate the accuracy of a regression model?

RMSE is typically used to evaluate the accuracy of a regression model, as it takes into account both the bias and variability of the model. It gives a better indication of how well the model is fitting the data compared to just looking at the SD of the data alone.

Can the SD of data and RMSE of regression line be compared directly?

No, the SD of data and RMSE of regression line cannot be compared directly as they are measuring different things. The SD of data measures the spread of the actual data points, while the RMSE of regression line measures the accuracy of the predicted values from the regression line.

What does a high SD of data or RMSE of regression line indicate?

A high SD of data indicates that the data points are spread out from the mean, suggesting high variability. On the other hand, a high RMSE of regression line indicates that the predicted values from the regression line are far from the actual data points, suggesting a poor fit of the model to the data.

Is it possible for a regression model to have a low RMSE but a high SD of data?

Yes, it is possible for a regression model to have a low RMSE but a high SD of data. This could happen if the regression line is able to capture the overall trend of the data, but there are still some data points that are far from the line, resulting in a high SD of data. However, it is generally preferred to have both a low RMSE and a low SD of data to have a more accurate and precise model.

Similar threads

  • STEM Educators and Teaching
Replies
11
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
493
  • MATLAB, Maple, Mathematica, LaTeX
Replies
9
Views
1K
Replies
8
Views
2K
  • Set Theory, Logic, Probability, Statistics
2
Replies
64
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
987
  • Calculus and Beyond Homework Help
Replies
3
Views
10K
  • Calculus and Beyond Homework Help
Replies
2
Views
1K
  • Calculus and Beyond Homework Help
Replies
1
Views
1K
Back
Top