Let's say a student does a simple experiment where she conducts 10 trials at each x value (at each value of the independent variable). She collects data over 30 x values, giving her 300 total trials. For each of the 30 x values, she averages the 10 y values and she calculates the standard deviation in those 10 y values. She makes a plot of average y vs. x in Excel, and uses the standard deviations for y error bars. Assume the plot is linear. Assume there's no error/uncertainty in individual x values. Perhaps also assume the individual y errors are all uniformly distributed and equal. I want this to be a simplest possible case. Next, the student (a) adds a linear trendline in Excel, (b) has Excel calculate the slope of her line of best fit, and (c) has Excel calculate the standard error in the slope. I have three questions: Why doesn't the standard error in the slope use the individual y uncertainties as inputs to its calculation? What is the theoretical basis for using a standard error calculation that ignores the individual y measurement errors/uncertainties? Does the standard error in the slope fail to represent something crucial about the uncertainty in the slope, due to the fact that the standard error calculation ignores the individual y measurement uncertainties? Assume the student uses the equation from the top line (below) to calculate manually the slope of her best fit line. She then uses the rules of uncertainty propagation to propagate the individual y measurement uncertainties through this equation. In so doing, she obtains a value of uncertainty in the slope. How would this value compare to the Excel-calculated standard error in the slope?
There is a method that uses the uncertainties in the individual data points: a weighted least-squares fit. See here, for example, and scroll down to page 8 ("Weighted Least Squares Straight Line Fitting"): https://www.che.udel.edu/pdf/FittingData.pdf
Thanks for the reply! As I understand it, a weighted least-squares fit is used only if the y errors (and x errors, if using an orthogonal regression) differ among different data points. Under the simplest circumstances, the individual y errors are all the same, and a weighted least-squares fit simplifies into a regular unweighted regression. I'm interested in is the simplest case where weighting is unnecessary. Even in this simplest case, I think there is a theoretical reason for seemingly ignoring the individual y measurement uncertainty when calculating the standard error of the slope. I think the correct reason is this: if the individual y error/uncertainty is the same for all data points, then--across the entire data set--the y values will fluctuate within that fixed y uncertainty. The standard error of the regression, also called the standard error of the estimate, uses the residuals to estimate the average y error in the data set. Hence, the individual y errors are not ignored in the standard error of the regression. Rather, the standard error of the regression represents the average amount of y error in each measurement in either direction (i.e., not distinguishing between positive error wherein the y measurement is above the true value and negative error wherein the y measurement is below the true value). If this is correct, then it would answer question #1 from my original post, since I believe the standard error of the slope is calculated from the standard error of the regression. I believe questions #2 and #3 in my original post are still unanswered. Any guidance is much appreciated!
Please also correct any improper conflation of the terms "residual" and "error" in my posts, along with any other incorrect usage of statistics terms.
I think the point is that in most situations you don't know the standard error of y but you have to estimate it from your data. Linear regression does exactly this.
Thanks! Many physicists use the standard error in the slope (which I believe is calculated from the standard error of the regression or the SEM) as the uncertainty in the slope. This practice is what I'm interested in, particularly since in simple manipulations of data the uncertainty is propagated through the operation. In regression the approach of propagating uncertainty appears to be abandoned despite the fact that an operation is being performed to calculate the slope from other values which have uncertainty. I'm trying to better understand the justification for abandoning the uncertainty propagation rules in favor of the standard deviation value, which is calculated from the residuals. NOTE: If I stated previously that the individual y measurement errors are KNOWN, I misspoke. In my original example, I intend for the individual y standard deviations (uncertainties) to be known and for the individual y errors to be unknown prior to the regression.
Because Excel fits the slope to the inputs ##\{(x, \bar{y})\}## using a simple least squares method. The 'error bar' input is only used for drawing error bars on the chart, it is not used for any statistical analysis. You are assuming that there is some quantifiable underlying concept of "uncertainty in the slope" independent of the statistical method that is used to estimate the slope - there isn't. That's one form of the equation for OLS (Ordinary Least Squares) regression, which is what Excel uses. It would differ by the difference between the method she has used and OLS, which is what Excel uses. Is there any statistical basis for her calculation? If you want a more sophisticated analysis of your data, you need to use a more sophisticated tool than OLS - either weighted least squares or some better method.
Thanks so much MrAchovy! This is very helpful to me. I am unaware of the statistical basis for the rules of uncertainty propagation. Here are the rules I'm thinking of (pasted in from the Wikipedia article): If I understand you correctly, the equation from my original post for calculating the slope is the version of OLS that Excel uses. According to that equation, the estimated slope of the best fit line is a function of the individual x and y measurements. Consider that the student used Excel to calculate the slope and the standard error of the slope. The student then used the uncertainty propagation rules pictured above to calculate the uncertainty in the slope based upon the uncertainty in the individual x and y measurements. How would the standard error of the slope (as reported by Excel) compare to the uncertainty in the slope (as calculated by the student using the error propagation rules)? If the two values are different, then my follow-up questions are: (a) why is there a discrepancy, and (b) which of the two values is better--the standard error of the slope or the uncertainty generated by the propagation rules cited above in this post?
I don't think the independence conditions of that rule apply to the OLS calculation - think about it, the more points you add the more confident you should become in the fit (unless they are outliers) but in that equation the more points you add the greater ## s_f ## becomes. If you want to measure the goodness of fit to all the data, search for "linear regression goodness of fit"; I think an f-test is probably a better place to start than the regression coefficient.
How do you estimate the individual uncertainties? Probably as ##\sum_j(y_ij-\bar{y}_i)^2/(n_i-1)## however, the ##\bar{y}_i## are not independent from each other, as they are bound to lie on the regression line. So you need to solve the regression equation first. Also, by assumption, the variances are equal, so you can get a better estimate by using the combined estimate from the linear regression.