Error on regression line slope

h_user · Jan 14, 2015

I'm currently trying to determine the error on the slope of a regression line and the y-intercept.

My y values are: My y error is: My x values are:
27.44535013 0.03928063 136
29.78207524 0.07836946 44
27.4482858 0.0385213 143
27.27481069 0.02117426 153

I'd like to code the solution and have attempted to do so with python. So far I have generated different sets of data by adding or subtracting the error on y to get all the possible regression lines within the errors and then determining the slope and y intercept, to get the max and min slope and y-intercept, and therefore the error. I'm sure if this is the correct method though, and when I apply it to a larger data set the number of regression lines I have to calculate is so large the code breaks. Is there a simpler solution, or an equation I'm missing that takes account of the y error in the error for the slope?

BvU · Jan 14, 2015

Hi h, welcome to PF :)

Kirchner (Berkeley) gives a derivation and the expressions here

[edit] His eqn (16) looks terrible in my PDF reader (Adobe XI), so I render what I can deduct, since the expressions are needed for ##s_a##:$$e_i = Y_i - \hat Y_i$$$$SSE = \sum e_i^2$$$$MSE = s_{Y {\bf\cdot} X}^2={SSE\over n-2}=Var(Y)\;(1-r^2)\;{n-1\over n-2}$$$$RMSE = s_{Y {\bf\cdot} X}=\sqrt{SSE\over n-2}=S_Y\;\sqrt{(1-r^2)}\;\sqrt{n-1\over n-2}$$And here ##S_Y## is not the square root of his ##SS_Y##, but the square root of his ##SS_Y/3##. Very tricky.

As you can guess, I did some work here. You do yours too and we'll compare if you want. Friday at the earliest, I'm afraid.

PS ten digits is a bit much for this kind of scatter. They must be calculation results ? Of what ?

[edit2] from the last expression above you can see that in fact you don't need the ##\sum e_i##, SSE, MSE, since ##SS_Y## and ##r^2## are enough !

h_user · Jan 14, 2015

Hey thanks for the reply.

Does this solution take into account the error on the y value? Should I be drawing a line of best fit that is weighted for the errors?

Quantum Defect · Jan 14, 2015

h_user said:

Hey thanks for the reply.

Does this solution take into account the error on the y value? Should I be drawing a line of best fit that is weighted for the errors?

Look in either:

Bevington: "Data Reduction and Error Analysis in the Physical Sciences" (most good colege libraries have this)
Press, et al. "Numerical Recipes in C"

You need to look up "weighted least squares" in these sources. The result is similar to what BvU has above.

You can also see:

http://en.wikipedia.org/wiki/Least_squares (Section 6 talks about weighted least squares)
http://www.itl.nist.gov/div898/handbook/pmd/section1/pmd143.htm
http://elsa.berkeley.edu/eml/ra_reader/14-wls.pdf
http://www.stat.ncsu.edu/people/bloomfield/courses/st430-514/slides/MandS-ch09-sec04-04.pdf

BvU · Jan 14, 2015

You can give weights to the measurements: instead of e.g. ##\sum y_i## you do ##\sum {y_i\over \sigma_i^2}## etc. And instead of dividing by N you divide by ##\sum {1\over \sigma_i^2}##

In your case it doesn't make much difference (*): the point with deviating weight lies so far away from the other points that its leverage makes the line go through anyway.PS how accurate are your ##x_i## ? The whole analysis is based on ##\sigma_x \ll \sigma_y##

[edit] (*) Have to withdraw that: uncertainties come out twice as high

Ray Vickson · Jan 14, 2015

h_user said:

I'm currently trying to determine the error on the slope of a regression line and the y-intercept.

My y values are: My y error is: My x values are:
27.44535013 0.03928063 136
29.78207524 0.07836946 44
27.4482858 0.0385213 143
27.27481069 0.02117426 153

I'd like to code the solution and have attempted to do so with python. So far I have generated different sets of data by adding or subtracting the error on y to get all the possible regression lines within the errors and then determining the slope and y intercept, to get the max and min slope and y-intercept, and therefore the error. I'm sure if this is the correct method though, and when I apply it to a larger data set the number of regression lines I have to calculate is so large the code breaks. Is there a simpler solution, or an equation I'm missing that takes account of the y error in the error for the slope?

You are doing it the hard way. If the errors are normally distributed (with mean 0 and common---but unknown---variance), there are standard formulas for confidence intervals in the slope, intercept and predicted y(x) value. See, eg., http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis , especially the later section called "Confidence Intervals in Simple Linear Regression". This has all the needed formulas and works through the details on some examples.

If your errors are not normally-distributed you may need to resort to "resampling methods", such as bootstrapping, jacknifing, etc. See, eg.,
http://wise.cgu.edu/downloads/Introduction%20to%20Resampling%20Techniques%20110901.pdf
for an introduction to the concepts.

BvU · Jan 15, 2015

Ray Vickson said:

You are doing it the hard way. If the errors are normally distributed (with mean 0 and common---but unknown---variance), there are standard formulas for confidence intervals in the slope, intercept and predicted y(x) value. See, eg., http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis , especially the later section called "Confidence Intervals in Simple Linear Regression". This has all the needed formulas and works through the details on some examples.

If your errors are not normally-distributed you may need to resort to "resampling methods", such as bootstrapping, jacknifing, etc. See, eg.,
http://wise.cgu.edu/downloads/Introduction%20to%20Resampling%20Techniques%20110901.pdf
for an introduction to the concepts.

What poster wants is the simple expressions if the errors are normally distributed but not equal for all points. They exist and are alike Kirchners.
(Don't have time now - perhaps tomorrow)
Resampling and such are way overdone.

BvU · Jan 16, 2015

Hope we haven't lost h. But I promised something, so here goes:

Dear h,

Let
$$
\quad \overline Y = \sum {y_i\over \sigma_i^2} \Big / \sum {1\over \sigma_i^2}, \\
\quad \overline X = \sum {x_i\over \sigma_i^2} \Big / \sum {1\over \sigma_i^2},\\
\quad \overline {XY} = \sum {x_i y_i\over \sigma_i^2} \Big/ \sum {1\over \sigma_i^2}, \\
\quad \overline {Y^2} = \sum {y_i^2\over \sigma_i^2} \Big/ \sum {1\over \sigma_i^2} \\
\quad \overline {X^2} = \sum {x_i^2\over \sigma_i^2} \Big/ \sum {1\over \sigma_i^2}
$$
Then let
$$
\quad SS_X = \overline {X^2} - \overline {X}^2,\qquad SS_Y = \overline {X^2} - \overline {X}^2,\qquad SS_{XY} = \overline {XY^2} - \overline {X}\;\overline {Y}
$$
These are Kirchners (11)-(13) but divided by n, so from here on we can use his expressions as long as we have the same power in numerator and denominator.

For the record:
$$
\quad r^2 = {S_{XY}^2 \over SS_X\;SS_Y} \\
\quad b = r\; {S_Y\over S_X} \qquad \left ( \; = {S_{XY} \over S_X^2} \right )\\
\quad a = \overline Y - b \overline X
$$
And now come the all important ##\sigma##:
$$
\qquad\sigma_b^2 = {SS_Y/SS_X - b^2\over n-2}\\ \ \\
\qquad\sigma_a^2 = \sigma_b^2 \left ( {SS_X\over n} +\overline {X}^2 \right )
$$
The Bevington reference by QD (mine is from the seventies :) ) is really excellent: it has everything, even clearer and more extensive (at various levels), plus a fortran (!) listing that shouldn't be too difficult to re-engineer to python.

Again: if possible use physics sense: your y look like calculation results; systematic errors don't average out, so be sure you keep common factors separate. You can even analyze the results for this: if weighted and unweighted show quite different results, there might be something wrong with the error estimates.

Your results don't really need ten digit accuracy. And you have to ask if your sigma's are really distinct: the relative accuracy of a sigma based on averaging k measurements is around ##1/\sqrt k ##. The 0.02 differs considerably from the 0.08 -- there might be an experimental reason for that.

To top it all off, I add a few pictures from what I got using your data.
Red dot is center of gravity unweighted, green weighted. Unweighted result is identical to excel linear trend. Dashed lines are fit (middle one ) and idem +/- uncertainty in predicted ##y_i## (Kirchner (20)).

I'll be glad to share the numerical results (in the sense of: sharing. You show your working and then I'll do the same :) ). I am also interested in the context: what's x, y, how did the y and ##\sigma_i## come about ?

Oh, and: anyone with comments/corrections: very welcome!

Error on regression line slope

Attachments

Similar threads

Prove that the integral is equal to ##\pi^2/8##

Distance between a Clock's hands when the distance is increasing most rapidly

Limit of piecewise function using epsilon delta

Volume with spherical coordinates

Use greedy vertex coloring algorithm to prove the upper bound of χ

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers