Linear regression, including Uncertainties

  1. My problem in short:

    I have a set of data, and I want to calculate the linear regression, and the uncertainty of the slope of the linear regression line, based on the uncertainties of the variables


    My problem in detail:

    My data is from an experiment and the uncertainties (errors) are from experimental imprecision.

    In my case I am comparing these two variables
    x= a reading on a pressure meter,
    y= a number on a counter.

    Every time the pressure meter went over a multiple of 100 (100, 200, 300, etc), I noted down the values of X and Y (pressure meter and counter)
    I estimate the error of my reading on the pressure meter to be 10, and my error of reading on the counter to be 1.

    So some points from my data could look like this:

    x1 = 100 ± 10 y1 = 4 ± 1
    x2 = 200 ± 10 y2 = 7 ± 1
    x3 = 300 ± 10 y3 = 13 ± 1

    So I say that the error for every x value is ± 10
    and the error for every y value is ± 1

    My goal is to find the slope (or the formula) for the linear regression line through these data points, and through the point (0,0) (intercept = 0). This is easy part though.

    Most of all I'd like to find the uncertainty of the slope of the line, based on the uncertainties of the X and Y values.

    I have tried various programs, including excel, graphical analysis, prism and pro fit, without luck. Anyone know of a program to do this, or the mathematical method I could use?

    regards
    Frímann Kjerúlf
     
  2. jcsd
  3. Chegg
    You could try Excel again.
    The function linest could solve your problem. But some interpretation of the results may be necessary.
    See the Excel help for full details on linest.The syntax is as follows:

    LINEST(known_y's,known_x's,const,stats)

    The parameter const is true or false depending if you include or not a constant in the regression (y=ax+b or y=ax).

    The stats parameter is very important for you.
    If you set it to true, excel will provide you the "statistics".
    These include the uncertainties on the slopes.

    See the help or see this web page for example.

    warning: This advice maybe a bit too optimistic. You will see that linest (y,x,...) does not give the same result as linest(x,y,...). You should think about that, why is it so. this is related to the uncertainties on both x and y.

    To understand the principles
    you could read "numerical recipes" there: http://www.nrbook.com/a/bookcpdf.php
    Download the necessary plugin and jump to chapter 15.3: "Straight-Line Data with Errors in Both Coordinates ".

    After reading that, there will be many options for you.
    You might check if the results provided by linest can be fully exploited according to the theory.
    Alternatively, you could program what is writen in this chapter. In Excel it would be easy too.
    You might also go further:
    - assuming a given correlation line, what is the probability of observing your experimental data
    - chosing the line to mach the highest probability (likelyhood)
    - trying to find the probabilty distribution for the slope
    - this you could do by simulating experimental points around your regression and calculating the slope each time for this set of simulated data
    - ... a lot of fun if you want
    - you could also read about parameter estimation theory, "statistics" in the MsGrawHill collection should give the formula for the uncertainty on the slope
    - numerical recipes, formula 15.2.19 could be of interrest to you, but you would need to modify it to account for uncertainties on both x and y, not so difficult I think

    An additional approach
    You might try to generalise formula 15.2.19 from numerical recipes, following the lines of chapter 15.3. This should not be too difficult, intuitively. Read around formula 15.3.5.
    More importantly, you could proceed numerically, by calculating the sensitivity of the chi² to small changes in the slope.

    Enjoy your crunch,

    michel
     
    Last edited: Feb 16, 2007
  4. Hi

    Thanks for your detailed answer :)

    I took a look at linest in excel, and it seems to me that this method only calculates the error from the points, but does not take into account any uncertainty of the points.

    I also looked at the book you pointed me to. Seems like this is exactly the info I need, though the math seems a little hard, would take me some time to figure out. But from my first look then it seemed that these formulas only work for uncertainties on y, and give that x is always exact. I might be wrong though. But in my case I need to calculate the slope uncertainty from both the x and y uncertainties.

    I have an idea though.

    What if I use the first and last x value in my dataset and based on the uncertainty of x and y, I calculate the slope of the "worst line" through these two points. Then subtract that slope from the slope of the regression line through the dataset. And use that as my uncertainty?

    Something like this:

    Using excel I get a formula for the regression line which might be:
    y=10 * x

    From that I know that the slope for the best line (regression line) is 10.

    I estimate the uncertainty of x to be ± 2
    And the uncertainty of y to be ± 10

    So now I have:

    Δx = ± 2 uncertainty of x
    Δy = ± 10 uncertainty of y
    x1 = 33 first x in the data set
    x2 = 113 last x in the data set
    a1 = 10 slope of the linear regression line y = a1 * x
    y1 = 330 calculated values of the endpoints in the regression line
    y2 = 1130 from the equation y = a1 * x


    Now I give myself that the worst line through these two points, is the line that has the most slope, but is still within the uncertainties of the two points.
    See picture for better explanation:

    [​IMG]

    Now using the end points of the worst line ( X1 , Y1 ) and ( X2 , Y2 ) I calculate the slope of the worst line

    X1 = x1 + Δx = 35
    Y1 = y1 - Δy = 320

    X2 = x2 - Δx = 111
    Y2 = y2 + Δy = 1140

    So the slope for the "worst line" would be:

    a2 = ( Y2 - Y1 ) / ( X2 - X1 ) = 10.8

    Now subtracting a1 from a2 to get the difference of the slopes:

    a2 - a1 = 0.8

    Could I use that difference as the uncertainty of the slope, based on the uncertainty of the data set???

    So the slope of the regression line would be: 10 ± 0.8

    Would this work?



    Really hope I got this right :)
    regards
    Frímann Kjerúlf
     
  5. I forgot to add that the correlation coefficient for the dataset is 0.999, and I would say that this method only works when the correlation coefficient is very close to 1
     
  6. dreamspy,

    It is clear that with the small number of points in the data set (4 points, including (0,0)), looking at the various lines that can be drawn gives you a few possible slopes. Therefore, you can easily give a range for the estimated slope.

    In addition, I now understand that your point (x,y)=(0,0) has no error on it.
    Therefore you are looking for a regression without constant term: y= a*x (and not y=a*x+b).

    In this case, you only need to calculate the slope based on each of your three data points as well as the uncertainty on each of theses slopes:

    s1=y1/x1 standard deviation d1
    s2=y2/x2 standard deviation d2
    s3=y3/x3 standard deviation d3

    Above, d1 is given by the relation d1² = (dy1²*x1²+dx1²*y1²)/x1^4 , if assuming uncorrelated Gauss distributions for x1 and y1.
    Similar formulas for d2 and d3.

    You can then calculate the most probable slope and the uncertainty on this most probable slope.
    In this most probable slope, each of the slope calculated from each given point will have a weight.
    This weight will be greater for the most precise evaluations.
    Therefore, point P3=(300,13) will be the most important.
    Probably the information provided by the points P1 and P2 will play a smaller role.

    You need to look in a statistics book how s1, d1, s2, d2, s3, and d3 can be combined to get the most probable estimate and its uncertainty: s and d.

    There could be a little be more to look at in statistics.
    Indeed, it may be possible that s1 and d1 are in contradiction with s2 and d2 for example.
    This should be not be the case with your data, but this can happen sometimes.
    Generally it is important to check if different data are compatible.
    Look in the "variance analysis" chapter of a statistical book.

    Michel

    Postscriptum:

    I got these slopes and uncertainties fom the three data points:

    slope uncertainty
    0.040 0.0108 (point 1)
    0.035 0.0053 (point 2)
    0.043 0.0036 (point 3)

    You can see that indeed that point 3 provides the best data.
    You can also see that point 2 is nearly inconsistent with other data, depending on the probability tolerance. Indeed, random errors have little chance to explain such a large difference with point 3. To be checked.
     
    Last edited: Feb 17, 2007
  7. Am. J. Phys. Paper on Uncertainty of Slope (Best Fit)

    This paper might be of interest to you about the uncertainty
    in slope after regression analysis has been performed.

    Michael J. Ruiz
    UNC-Asheville

    American Journal of Physics -- February 1991 -- Volume 59, Issue 2, pp. 184-185

    Uncertainty in the linear regression slope
    Jack Higbie
    Department of Physics, University of Queensland, Brisbane 4072, Australia

    (Received 12 December 1989; accepted 28 January 1990)

    ©1991 American Association of Physics Teachers

    doi:10.1119/1.16607
    PACS: 06.50.Mk, 02.60.Ed
     
  8. Thanks for your answer. Is this paper available online? I did a quick library search here in Iceland and didn't find a copy.

    regards
    frímann
     
  9. The Paper on Slope Uncertainty

    Hi,

    I would try first to see if you school library has the
    hard copy of the journal: American Journal of Physics.
    Then, check if your library has a subscription to it -
    many schools do. If that does not work, then go
    to the journal web site but you will have to pay a
    nominal fee to download it I believe. It is a very short
    paper.

    The key formula is this: the uncertainty
    sigma(slope) = |slope| tan[arccos(R)]/sqr(N-2)
    where R is the correlation coefficent
    R = cov(x,y)/sqr[var(x)var(y)]

    and N - 2 refers to the number of degrees of
    freedom in the data - where 2 have been lost to fit the
    slope and intercept.

    I am now studying this area of statistics - I am
    not an expert. I am still searching on the internet
    for an equivalent discussion and might find one. By
    the way, the paper refers to Mathews and Walker -
    Mathematical Physics - second edition for some
    related analsys. I hope this helps.

    Mike
     
Know someone interested in this topic? Share a link to this question via email, Google+, Twitter, or Facebook

Have something to add?