Error on regression line slope

In summary, there are standard formulas for determining the error on the slope of a regression line and the y-intercept if the errors are normally distributed with mean 0 and a common variance. These formulas can be found in sources such as "Data Reduction and Error Analysis in the Physical Sciences" and "Numerical Recipes in C." If the errors are not normally distributed, resampling methods such as bootstrapping or jackknifing may be necessary. However, in this case, the poster is looking for simple expressions that apply to normally distributed errors with unequal variances, which can be found in Kirchner's work.
  • #1
h_user
2
0
I'm currently trying to determine the error on the slope of a regression line and the y-intercept.

My y values are: My y error is: My x values are:
27.44535013 0.03928063 136
29.78207524 0.07836946 44
27.4482858 0.0385213 143
27.27481069 0.02117426 153


I'd like to code the solution and have attempted to do so with python. So far I have generated different sets of data by adding or subtracting the error on y to get all the possible regression lines within the errors and then determining the slope and y intercept, to get the max and min slope and y-intercept, and therefore the error. I'm sure if this is the correct method though, and when I apply it to a larger data set the number of regression lines I have to calculate is so large the code breaks. Is there a simpler solution, or an equation I'm missing that takes account of the y error in the error for the slope?
 
Physics news on Phys.org
  • #2
Hi h, welcome to PF :)

Kirchner (Berkeley) gives a derivation and the expressions here

[edit] His eqn (16) looks terrible in my PDF reader (Adobe XI), so I render what I can deduct, since the expressions are needed for ##s_a##:$$e_i = Y_i - \hat Y_i$$$$SSE = \sum e_i^2$$$$MSE = s_{Y {\bf\cdot} X}^2={SSE\over n-2}=Var(Y)\;(1-r^2)\;{n-1\over n-2}$$$$RMSE = s_{Y {\bf\cdot} X}=\sqrt{SSE\over n-2}=S_Y\;\sqrt{(1-r^2)}\;\sqrt{n-1\over n-2}$$And here ##S_Y## is not the square root of his ##SS_Y##, but the square root of his ##SS_Y/3##. Very tricky.

As you can guess, I did some work here. You do yours too and we'll compare if you want. Friday at the earliest, I'm afraid.

PS ten digits is a bit much for this kind of scatter. They must be calculation results ? Of what ?

[edit2] from the last expression above you can see that in fact you don't need the ##\sum e_i##, SSE, MSE, since ##SS_Y## and ##r^2## are enough !
 
Last edited:
  • #3
Hey thanks for the reply.

Does this solution take into account the error on the y value? Should I be drawing a line of best fit that is weighted for the errors?
 
  • #4
h_user said:
Hey thanks for the reply.

Does this solution take into account the error on the y value? Should I be drawing a line of best fit that is weighted for the errors?

Look in either:

Bevington: "Data Reduction and Error Analysis in the Physical Sciences" (most good colege libraries have this)
Press, et al. "Numerical Recipes in C"

You need to look up "weighted least squares" in these sources. The result is similar to what BvU has above.

You can also see:

http://en.wikipedia.org/wiki/Least_squares (Section 6 talks about weighted least squares)
http://www.itl.nist.gov/div898/handbook/pmd/section1/pmd143.htm
http://elsa.berkeley.edu/eml/ra_reader/14-wls.pdf
http://www.stat.ncsu.edu/people/bloomfield/courses/st430-514/slides/MandS-ch09-sec04-04.pdf
 
Last edited by a moderator:
  • #5
You can give weights to the measurements: instead of e.g. ##\sum y_i## you do ##\sum {y_i\over \sigma_i^2}## etc. And instead of dividing by N you divide by ##\sum {1\over \sigma_i^2}##

In your case it doesn't make much difference (*): the point with deviating weight lies so far away from the other points that its leverage makes the line go through anyway.PS how accurate are your ##x_i## ? The whole analysis is based on ##\sigma_x \ll \sigma_y##

[edit] (*) Have to withdraw that: uncertainties come out twice as high
 
Last edited:
  • #6
h_user said:
I'm currently trying to determine the error on the slope of a regression line and the y-intercept.

My y values are: My y error is: My x values are:
27.44535013 0.03928063 136
29.78207524 0.07836946 44
27.4482858 0.0385213 143
27.27481069 0.02117426 153


I'd like to code the solution and have attempted to do so with python. So far I have generated different sets of data by adding or subtracting the error on y to get all the possible regression lines within the errors and then determining the slope and y intercept, to get the max and min slope and y-intercept, and therefore the error. I'm sure if this is the correct method though, and when I apply it to a larger data set the number of regression lines I have to calculate is so large the code breaks. Is there a simpler solution, or an equation I'm missing that takes account of the y error in the error for the slope?

You are doing it the hard way. If the errors are normally distributed (with mean 0 and common---but unknown---variance), there are standard formulas for confidence intervals in the slope, intercept and predicted y(x) value. See, eg., http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis , especially the later section called "Confidence Intervals in Simple Linear Regression". This has all the needed formulas and works through the details on some examples.

If your errors are not normally-distributed you may need to resort to "resampling methods", such as bootstrapping, jacknifing, etc. See, eg.,
http://wise.cgu.edu/downloads/Introduction%20to%20Resampling%20Techniques%20110901.pdf
for an introduction to the concepts.
 
Last edited by a moderator:
  • #7
Ray Vickson said:
You are doing it the hard way. If the errors are normally distributed (with mean 0 and common---but unknown---variance), there are standard formulas for confidence intervals in the slope, intercept and predicted y(x) value. See, eg., http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis , especially the later section called "Confidence Intervals in Simple Linear Regression". This has all the needed formulas and works through the details on some examples.

If your errors are not normally-distributed you may need to resort to "resampling methods", such as bootstrapping, jacknifing, etc. See, eg.,
http://wise.cgu.edu/downloads/Introduction%20to%20Resampling%20Techniques%20110901.pdf
for an introduction to the concepts.
What poster wants is the simple expressions if the errors are normally distributed but not equal for all points. They exist and are alike Kirchners.
(Don't have time now - perhaps tomorrow)
Resampling and such are way overdone.
 
Last edited by a moderator:
  • #8
Hope we haven't lost h. But I promised something, so here goes:

Dear h,

Let
$$
\quad \overline Y = \sum {y_i\over \sigma_i^2} \Big / \sum {1\over \sigma_i^2}, \\
\quad \overline X = \sum {x_i\over \sigma_i^2} \Big / \sum {1\over \sigma_i^2},\\
\quad \overline {XY} = \sum {x_i y_i\over \sigma_i^2} \Big/ \sum {1\over \sigma_i^2}, \\
\quad \overline {Y^2} = \sum {y_i^2\over \sigma_i^2} \Big/ \sum {1\over \sigma_i^2} \\
\quad \overline {X^2} = \sum {x_i^2\over \sigma_i^2} \Big/ \sum {1\over \sigma_i^2}
$$
Then let
$$
\quad SS_X = \overline {X^2} - \overline {X}^2,\qquad SS_Y = \overline {X^2} - \overline {X}^2,\qquad SS_{XY} = \overline {XY^2} - \overline {X}\;\overline {Y}
$$
These are Kirchners (11)-(13) but divided by n, so from here on we can use his expressions as long as we have the same power in numerator and denominator.

For the record:
$$
\quad r^2 = {S_{XY}^2 \over SS_X\;SS_Y} \\
\quad b = r\; {S_Y\over S_X} \qquad \left ( \; = {S_{XY} \over S_X^2} \right )\\
\quad a = \overline Y - b \overline X
$$
And now come the all important ##\sigma##:
$$
\qquad\sigma_b^2 = {SS_Y/SS_X - b^2\over n-2}\\ \ \\
\qquad\sigma_a^2 = \sigma_b^2 \left ( {SS_X\over n} +\overline {X}^2 \right )
$$
The Bevington reference by QD (mine is from the seventies :) ) is really excellent: it has everything, even clearer and more extensive (at various levels), plus a fortran (!) listing that shouldn't be too difficult to re-engineer to python.

Again: if possible use physics sense: your y look like calculation results; systematic errors don't average out, so be sure you keep common factors separate. You can even analyze the results for this: if weighted and unweighted show quite different results, there might be something wrong with the error estimates.

Your results don't really need ten digit accuracy. And you have to ask if your sigma's are really distinct: the relative accuracy of a sigma based on averaging k measurements is around ##1/\sqrt k ##. The 0.02 differs considerably from the 0.08 -- there might be an experimental reason for that.

To top it all off, I add a few pictures from what I got using your data.
Red dot is center of gravity unweighted, green weighted. Unweighted result is identical to excel linear trend. Dashed lines are fit (middle one ) and idem +/- uncertainty in predicted ##y_i## (Kirchner (20)).

I'll be glad to share the numerical results (in the sense of: sharing. You show your working and then I'll do the same :) ). I am also interested in the context: what's x, y, how did the y and ##\sigma_i## come about ?

Oh, and: anyone with comments/corrections: very welcome!
 

Attachments

  • LinLSQ_Unweighted.jpg
    LinLSQ_Unweighted.jpg
    20.8 KB · Views: 499
  • LinLSQ_Weighted.jpg
    LinLSQ_Weighted.jpg
    22.2 KB · Views: 523
Last edited:

What is the meaning of "error on regression line slope"?

When performing a regression analysis, the slope of the regression line is the measure of the relationship between the independent and dependent variables. The error on the slope refers to the amount of uncertainty or variation in the slope estimate. It indicates how much the slope could differ if the analysis was repeated with different data.

How is the error on the slope calculated?

The error on the slope is calculated using the standard error of the slope, which takes into account the variability of the data and the sample size. It is calculated by dividing the standard deviation of the residuals (the difference between the actual and predicted values) by the square root of the sum of the squared differences between the independent variable and its mean.

What does a high error on the slope indicate?

A high error on the slope suggests that there is a large amount of variation in the data and that the regression line may not accurately represent the relationship between the variables. This could be due to outliers, non-linear relationships, or other factors that are not accounted for in the model.

How does the error on the slope impact the interpretation of the regression line?

The error on the slope provides a measure of uncertainty in the slope estimate, so it is important to consider this when interpreting the regression line. A larger error on the slope means that the slope estimate is less reliable, and the relationship between the variables may not be as strong as initially thought.

What can be done to reduce the error on the slope?

To reduce the error on the slope, one can increase the sample size, which will decrease the standard error of the slope. Additionally, checking for outliers and ensuring that the data follows a linear relationship can also help improve the accuracy of the slope estimate. It is also important to carefully select the appropriate regression model for the data being analyzed.

Similar threads

  • STEM Educators and Teaching
Replies
11
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
489
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Calculus and Beyond Homework Help
Replies
6
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Calculus and Beyond Homework Help
Replies
2
Views
1K
Replies
4
Views
1K
  • Other Physics Topics
Replies
6
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
Back
Top