# Linear regression, including Uncertainties

My problem in short:

I have a set of data, and I want to calculate the linear regression, and the uncertainty of the slope of the linear regression line, based on the uncertainties of the variables

My problem in detail:

My data is from an experiment and the uncertainties (errors) are from experimental imprecision.

In my case I am comparing these two variables
x= a reading on a pressure meter,
y= a number on a counter.

Every time the pressure meter went over a multiple of 100 (100, 200, 300, etc), I noted down the values of X and Y (pressure meter and counter)
I estimate the error of my reading on the pressure meter to be 10, and my error of reading on the counter to be 1.

So some points from my data could look like this:

x1 = 100 ± 10 y1 = 4 ± 1
x2 = 200 ± 10 y2 = 7 ± 1
x3 = 300 ± 10 y3 = 13 ± 1

So I say that the error for every x value is ± 10
and the error for every y value is ± 1

My goal is to find the slope (or the formula) for the linear regression line through these data points, and through the point (0,0) (intercept = 0). This is easy part though.

Most of all I'd like to find the uncertainty of the slope of the line, based on the uncertainties of the X and Y values.

I have tried various programs, including excel, graphical analysis, prism and pro fit, without luck. Anyone know of a program to do this, or the mathematical method I could use?

regards
Frímann Kjerúlf

Related Introductory Physics Homework Help News on Phys.org
You could try Excel again.
The function linest could solve your problem. But some interpretation of the results may be necessary.
See the Excel help for full details on linest.The syntax is as follows:

LINEST(known_y's,known_x's,const,stats)

The parameter const is true or false depending if you include or not a constant in the regression (y=ax+b or y=ax).

The stats parameter is very important for you.
If you set it to true, excel will provide you the "statistics".
These include the uncertainties on the slopes.

See the help or see this http://www.colby.edu/chemistry/PChem/notes/linest.pdf" [Broken] for example.

warning: This advice maybe a bit too optimistic. You will see that linest (y,x,...) does not give the same result as linest(x,y,...). You should think about that, why is it so. this is related to the uncertainties on both x and y.

To understand the principles
you could read "numerical recipes" there: http://www.nrbook.com/a/bookcpdf.php

After reading that, there will be many options for you.
You might check if the results provided by linest can be fully exploited according to the theory.
Alternatively, you could program what is writen in this chapter. In Excel it would be easy too.
You might also go further:
- assuming a given correlation line, what is the probability of observing your experimental data
- chosing the line to mach the highest probability (likelyhood)
- trying to find the probabilty distribution for the slope
- this you could do by simulating experimental points around your regression and calculating the slope each time for this set of simulated data
- ... a lot of fun if you want
- you could also read about parameter estimation theory, "statistics" in the MsGrawHill collection should give the formula for the uncertainty on the slope
- numerical recipes, formula 15.2.19 could be of interrest to you, but you would need to modify it to account for uncertainties on both x and y, not so difficult I think

You might try to generalise formula 15.2.19 from numerical recipes, following the lines of chapter 15.3. This should not be too difficult, intuitively. Read around formula 15.3.5.
More importantly, you could proceed numerically, by calculating the sensitivity of the chi² to small changes in the slope.

michel

Last edited by a moderator:
Hi

I took a look at linest in excel, and it seems to me that this method only calculates the error from the points, but does not take into account any uncertainty of the points.

I also looked at the book you pointed me to. Seems like this is exactly the info I need, though the math seems a little hard, would take me some time to figure out. But from my first look then it seemed that these formulas only work for uncertainties on y, and give that x is always exact. I might be wrong though. But in my case I need to calculate the slope uncertainty from both the x and y uncertainties.

I have an idea though.

What if I use the first and last x value in my dataset and based on the uncertainty of x and y, I calculate the slope of the "worst line" through these two points. Then subtract that slope from the slope of the regression line through the dataset. And use that as my uncertainty?

Something like this:

Using excel I get a formula for the regression line which might be:
y=10 * x

From that I know that the slope for the best line (regression line) is 10.

I estimate the uncertainty of x to be ± 2
And the uncertainty of y to be ± 10

So now I have:

Δx = ± 2 uncertainty of x
Δy = ± 10 uncertainty of y
x1 = 33 first x in the data set
x2 = 113 last x in the data set
a1 = 10 slope of the linear regression line y = a1 * x
y1 = 330 calculated values of the endpoints in the regression line
y2 = 1130 from the equation y = a1 * x

Now I give myself that the worst line through these two points, is the line that has the most slope, but is still within the uncertainties of the two points.
See picture for better explanation:

http://213.213.137.96/~terminal/uncertainty.jpg

Now using the end points of the worst line ( X1 , Y1 ) and ( X2 , Y2 ) I calculate the slope of the worst line

X1 = x1 + Δx = 35
Y1 = y1 - Δy = 320

X2 = x2 - Δx = 111
Y2 = y2 + Δy = 1140

So the slope for the "worst line" would be:

a2 = ( Y2 - Y1 ) / ( X2 - X1 ) = 10.8

Now subtracting a1 from a2 to get the difference of the slopes:

a2 - a1 = 0.8

Could I use that difference as the uncertainty of the slope, based on the uncertainty of the data set???

So the slope of the regression line would be: 10 ± 0.8

Would this work?

Really hope I got this right :)
regards
Frímann Kjerúlf

Last edited by a moderator:
I forgot to add that the correlation coefficient for the dataset is 0.999, and I would say that this method only works when the correlation coefficient is very close to 1

dreamspy,

It is clear that with the small number of points in the data set (4 points, including (0,0)), looking at the various lines that can be drawn gives you a few possible slopes. Therefore, you can easily give a range for the estimated slope.

In addition, I now understand that your point (x,y)=(0,0) has no error on it.
Therefore you are looking for a regression without constant term: y= a*x (and not y=a*x+b).

In this case, you only need to calculate the slope based on each of your three data points as well as the uncertainty on each of theses slopes:

s1=y1/x1 standard deviation d1
s2=y2/x2 standard deviation d2
s3=y3/x3 standard deviation d3

Above, d1 is given by the relation d1² = (dy1²*x1²+dx1²*y1²)/x1^4 , if assuming uncorrelated Gauss distributions for x1 and y1.
Similar formulas for d2 and d3.

You can then calculate the most probable slope and the uncertainty on this most probable slope.
In this most probable slope, each of the slope calculated from each given point will have a weight.
This weight will be greater for the most precise evaluations.
Therefore, point P3=(300,13) will be the most important.
Probably the information provided by the points P1 and P2 will play a smaller role.

You need to look in a statistics book how s1, d1, s2, d2, s3, and d3 can be combined to get the most probable estimate and its uncertainty: s and d.

There could be a little be more to look at in statistics.
Indeed, it may be possible that s1 and d1 are in contradiction with s2 and d2 for example.
This should be not be the case with your data, but this can happen sometimes.
Generally it is important to check if different data are compatible.
Look in the "variance analysis" chapter of a statistical book.

Michel

Postscriptum:

I got these slopes and uncertainties fom the three data points:

slope uncertainty
0.040 0.0108 (point 1)
0.035 0.0053 (point 2)
0.043 0.0036 (point 3)

You can see that indeed that point 3 provides the best data.
You can also see that point 2 is nearly inconsistent with other data, depending on the probability tolerance. Indeed, random errors have little chance to explain such a large difference with point 3. To be checked.

Last edited:
Am. J. Phys. Paper on Uncertainty of Slope (Best Fit)

This paper might be of interest to you about the uncertainty
in slope after regression analysis has been performed.

Michael J. Ruiz
UNC-Asheville

American Journal of Physics -- February 1991 -- Volume 59, Issue 2, pp. 184-185

Uncertainty in the linear regression slope
Jack Higbie
Department of Physics, University of Queensland, Brisbane 4072, Australia

(Received 12 December 1989; accepted 28 January 1990)

©1991 American Association of Physics Teachers

doi:10.1119/1.16607
PACS: 06.50.Mk, 02.60.Ed

Thanks for your answer. Is this paper available online? I did a quick library search here in Iceland and didn't find a copy.

regards
frímann

The Paper on Slope Uncertainty

Hi,

I would try first to see if you school library has the
hard copy of the journal: American Journal of Physics.
Then, check if your library has a subscription to it -
many schools do. If that does not work, then go
to the journal web site but you will have to pay a
nominal fee to download it I believe. It is a very short
paper.

The key formula is this: the uncertainty
sigma(slope) = |slope| tan[arccos(R)]/sqr(N-2)
where R is the correlation coefficent
R = cov(x,y)/sqr[var(x)var(y)]

and N - 2 refers to the number of degrees of
freedom in the data - where 2 have been lost to fit the
slope and intercept.

I am now studying this area of statistics - I am
not an expert. I am still searching on the internet
for an equivalent discussion and might find one. By
the way, the paper refers to Mathews and Walker -
Mathematical Physics - second edition for some
related analsys. I hope this helps.

Mike