Estimating error in slope of a regression line

1. Oct 29, 2007

Signifier

OK, I have a question I have no idea how to answer (and all my awful undergrad stats books are useless on the matter). Say I make a number of pairs of measurements (x,y). I plot the data, and it looks strongly positively correlated. I do a linear regression and get an equation for a line of best fit, say y = 0.3x + 0.1 or something. The Pearson coefficient is very close to one, IE 0.9995 or so.

Now, say that the quantity I am interested in is the slope of this line, that is (for the above equation) 0.3. I take all my measurements, get the line of best fit, find its slope, and the slope is something I want. For example, with the photoelectric effect, maybe I measure stopping potential vs. frequency of light; the slope can be related to Planck's constant. Or something similar.

The question I have is: how do I estimate the error (uncertainty) in this slope value I get? My professor said to use the "standard deviation in the slope," which doesn't sound sensible to me. I thought to myself: well, maybe it has to do with using the uncertainty in x and the uncertainty in y. But how would you combine these uncertainties to find the uncertainty in dy/dx?

How does one estimate the error range for a parameter obtained from the slope of a line of best fit on a set of (x,y) data?

Thank you so much, this one seems really important and I'm a bit disturbed I haven't the slightest idea what to do.

2. Oct 29, 2007

EnumaElish

The standard assumption is that there is no uncertainty in x. y is the random variable.

If you run a regression in Excel (or any other more sophisticated statistics package) it will display the standard errors for both parameters.

3. Oct 29, 2007

Signifier

OK, that's a good assumption in my case. It would mean that the uncertainty in the slope is equal to the uncertainty in y, right?

Unfortunately I don't have Excel and I'm doing this all by hand, heh. How do I calculate the standard errors for both parameters by hand?

4. Nov 2, 2007

EnumaElish

I'll assume some familiarity with the linear algebra notation.

The estimated parameter vector is $\hat \beta = (X'X)^{-1}X'y$ where X = [1 x] is the n x 2 data matrix.

Substitute $X\beta + \epsilon$ for y.

Calculate $Var\left[{\hat \beta}^2\right]=Var\left[(\beta+(X'X)^{-1}X'\epsilon)^2\right]$.

Last edited: Nov 3, 2007
5. Nov 8, 2007

EnumaElish

EDIT: The last line of the last post should have been:

Calculate $Var\left[{\hat \beta}\right]=Var\left[\beta+(X'X)^{-1}X'\epsilon\right]$.

6. Feb 15, 2010

mdmann00

If the dependent or independent variables in your regression have error bars, then one way to estimate the error of the slope (and intercept) is via a Monte Carlo simulation. Let's say you are doing a linear fit of 10 experimental points. The 10 x-values each have some standard deviation, and the 10 y-values each have some standard deviation. Rather than doing a single linear regression, you do many regressions in which you create simulated data where the experimental points have a Gaussian distribution about their nominal x- and y-values according to the standard deviations. Let's say you generate 100 sets of 10 experimental points. You would then get 100 different linear regression results (100 slopes and 100 intercepts). You can then calculate the standard deviations of these slopes and intercepts to give you an estimate of their errors that takes into account the measurement errors on the experimental points. You may have to do more than 100 simulations. I usually vary this number to see where I get very little change in the answer.

It can be computationally intensive. I've been told there are other ways to do this, but I don't know what they are. If you try to blindly apply simple error propagation techniques, you will get absurd numbers, so don't try that.

7. Feb 15, 2010

In simple linear regression the standard deviation of the slope can be estimated as

$$\sqrt{\frac{\frac 1 {n-2} \sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (x_i - \overline x)^2}}$$

In comparison to post 6: rather than regenerating random data each time, you can carry out a bootstrap simulation using your original data, and obtain an estimate of the distribution of the slope. You can carry out the work for fixed or random predictors (slightly different setups in the calculations).

However, you'd either have to write the code yourself to use in Excel or get some software that has some real statistics capability (R (or S), SAS, Minitab (little work required with Minitab too).

8. Feb 15, 2010

mdmann00

Aloha statdad! Thanks for your reply. Can you speak some more about your bootstrap simulation approach? I don't quite understand it, and I am looking for a simpler way to estimate slope and intercept error bars than Monte Carlo if such exists. In particular, 1) what is the y-hat term?, and 2) I'm not seeing how error estimates in your x- and y-values is taken into account in this approach (perhaps the y-hat term does this, but what about x? x-bar does not encapsulate measurement error in x) I have access to IDL, and the Advanced Math and Statistics package, but that doesn't really help if I can't figure out how to utilize that functionality properly. I get Monte Carlo, though it is decidedly brute force. Any clarifying points you can provide would be much appreciated.

9. Feb 16, 2010

Mapes

The bootstrap approach is itself a Monte Carlo technique. It involves resampling your n data points over and over with replacement. Each time, you recalculate the slope of the best-fit line, building up a long list of slopes. The standard deviation of the list, multiplied by $\sqrt{[n/(n-1)]}$, is an estimator for the standard error for the original slope.

For example, if your data points are (1,10), (2,9), (3,7), (4,6), a few bootstrap samples (where you sample with replacement) might be (2,9), (2,9), (3,7), (4,6) (i.e., the first data point was omitted and the second picked twice), or (1,10), (3,7), (3,7), (3,7), or (1,10), (2,9), (3,7), (4,6). You do this a lot of times (perhaps thousands), fitting a slope to each sample, until the standard deviation of the slopes has converged to your desired accuracy. A caveat: the bootstrap technique works better with a larger original data set. Four points wouldn't cut it.

(Sorry to butt in here, statdad, but I discovered this technique last year and have been using it often in my own research and excitedly telling my colleagues about its usefulness in handling non-Gaussian data. Please let me know if I've made any errors in this explanation.)

10. Feb 16, 2010

mdmann00

Hmmm...very interesting, Mapes. Thanks for the response! I'm curious, though, it seems this approach would potentially overestimate the error in the slope by a fair amount, since replacing the point (2,9) with the point (3,7) may greatly exceed the actual error in the measurement of the point (2,9). Are there any general rules for how one does this replacement to minimize the chance of gross overestimation? Also, is there a formal name for this approach, such that I can try to find some references to read up on the technique? I don't want to keep bothering you guys when I can get answers on my own, but I don't know where to look for something like this.

11. Feb 16, 2010

Mapes

But that's the meaning of standard error of the slope; when taking data, you might just as well have measured (3,7) instead of (2,9). Lacking additional data, the bootstrap approach simulates additional data by sampling existing data. It might be helpful to try an example with normally distributed data and check that it matches analytical results from equations that assume a Gaussian distribution.

Chernick's Bootstrap Methods: A Practitioner's Guide is very clear.

12. Feb 16, 2010

mdmann00

OK, so if I understood you correctly, you're saying that *if* you don't have data to suggest what the actual x and y measurement errors are, this technique allows you to get *some* kind of estimate of the regression errors from the available data. If so, that makes sense.

I will take your advice and see how the bootstrap technique, in the absence of error data, compares with a Monte Carlo simulation *with* error data.

Thanks for the reference and the help! It is much appreciated.

13. Feb 16, 2010

Mapes

Exactly - good luck!

14. May 2, 2010

d3t3rt

A good reference for bootstrapping is Efron & Tibshirani (1993) An Introduction to the Bootstrap. They discovered the bootstrap. That said, I wish to address the inappropriateness of using a bootstrap to find the standard error of the slope and intercept of a simple linear regression.

The bootstrap is a sophisticated statistical procedure that is frequently used when one wishes to understand the variability and distributional form of some function (e.g., nonlinear combination) of sample estimates. Usually, this function of estimates has an unknown density. The slope and intercept of a simple linear regression have known distributions, and closed forms of their standard errors exist. Therefore, why complicate estimates of standard errors? If one were fitting a Bayesian model, then I could understand the use of MCMC methods. I highly doubt this is the case.

Also, inferences for the slope and intercept of a simple linear regression are robust to violations of normality. Unless the histogram of residuals evidences a strong departure from Normality, I would not be concerned with non-Normal errors. I would be more concerned about homogeneous (equal) variances.

If people lack software to compute standard errors of LS-regression estimates, I recommend using R. It is freeware that is available at www.r-project.org This is not a point and click interface. However, there is sufficient documentation to guide new users. The function lm() should be used for a linear regression. As a statistician, I despise the use of Excel for any statistical analysis!

15. May 2, 2010

mdmann00

Aloha d3t3rt,

If closed forms of the standard errors in linear regression exist, are these not what are used to estimated the standard errors of the slope and intercept in Excel? And if so, why should one not use that tool to do that calculation?

Thanks for the second reference.

16. May 2, 2010

d3t3rt

Here is a website outlining many of Excel's shortcomings:

http://www.cs.uiowa.edu/~jcryer/JSMTalk2001.pdf [Broken]

I am very suspect of the algorithms that Excel uses to calculate statistics. Very simple statistical summaries have been calculated incorrectly by Excel (e.g., sample standard deviation).

With respect to computer estimation of b0 and b1, statistics programs usually calculate these through an iterative computer algorithm. For example, when estimating the mean of a Normally distributed random variable, the maximum likelihood estimates are the sample mean. However, a computer calculates this estimate with an iterative computer algorithm like the Newton-Raphson or golden search algorithm. Generally, there is a one-to-one correspondence with the computer estimates of standard errors and their "brute-force" hand calculations.

Last edited by a moderator: May 4, 2017
17. May 3, 2010

"Also, inferences for the slope and intercept of a simple linear regression are robust to violations of normality. Unless the histogram of residuals evidences a strong departure from Normality, I would not be concerned with non-Normal errors. I would be more concerned about homogeneous (equal) variances."

The inferences are not robust to violations of normality - that fact is one of the reasons for the development of non-parametric and robust methods. Since histograms themselves can be misleading - the shape is easily influenced by the number of bins, for instance, and for small sample sizes histograms are virtually worthless, their use in outlier detection is minimal - even if you do graph the residuals.

Further, since high leverage points have the capability of controlling the entire fit, they will not be detected as outliers since they do not have large residuals. Graph the data and residuals several ways, not just the quickest way.

"The slope and intercept of a simple linear regression have known distributions, and closed forms of their standard errors exist."

These distributions are exact only when normality applies perfectly (which is never), and are convenient asymptotic descriptions otherwise. Using them when data are significantly non-normal isn't a good idea.

"I would be more concerned about homogeneous (equal) variances."
I wouldn't say more concerned, but of equal concern.

"The bootstrap is a sophisticated statistical procedure that is frequently used when one wishes to understand the variability and distributional form of some function (e.g., nonlinear combination) of sample estimates."
It can be used with non-linear statistics, but it is not limited to it, and work very well with regression.

"As a statistician, I despise the use of Excel for any statistical analysis!"
Best point. It was a long struggle at our school to convince the business group to dump Excel for its courses. Many years ago I was optimistic that the group inside Microsoft with responsibility for Excel would address the complaints. I gave up that hope not long after I started it.

18. May 3, 2010

d3t3rt

Statdad, thank you for fixing my statement about known standard errors and distributional forms for the sample slope and intercept.

19. Sep 3, 2010

Salish99

You can find it in most statistics texts. $\hat y_i$ is the ith predicted value of $y$.