Simple overestimate of slope uncertainty in regression

In summary, the students want to estimate the slope of a regression line using the uncertainty of the data points. They propose an estimate that uses the relative uncertainty on each data point. This overestimate is then used as the relative uncertainty on the slope in linear regression.
  • #1
Bjarke Nicolaisen
5
0
Hi all,

I am a science educator in high school. I have been thinking about how to make a simple estimate that 1st and maybe 2nd year students can follow for the propagation of error to the uncertainty of the slope in linear regression. The problem is typically that they make some measurements (x_i,y_i) of, say, time and distance, and then use linear regression to find the slope. For the uncertainty of the slope they can use the standard inferred empirical one, which assumes constant and equal variance in the y-data, etc. But this estimate relies not at all on the measured uncertainty on each data-point, delta_y_i. It only uses variation of data from the regression line, etc. Which is of course fine, but if I want to teach the students to propagate their errors on each y_i, it gets very messy. So I thought maybe the following estimate would be an idea, to start out with:

- Find the relative uncertainty on each data point, delta_y_i/y_i. Take the maximal value (overestimate).
- Divide by sqrt (N-2) for linear regression, (N-#parameters) in general.
- This relative uncertainty is then a simple overestimate for the relative uncertainty on the result, in my example the slope in linear regression.

I am more than a little unsure if this always ensures an overestimate. Although this does not need to be bulletproof, it is more to give the students an awareness of uncertainty propagation. I also know that here we are mixing two different methods of linear analyses, one where we assume no knowledge of the individual y_i errors (empirical), and one where we know the errors. But still, I find it an interesting idea. What do you guys think? Maybe you have a similar but more correct estimate, or simply an opinion.
 
Last edited:
Science news on Phys.org
  • #2
There is no such thing as a perfect error analysis. Searching for one is a fool's errand.

These students are 14 or 15 yeas old. They will want to write down 40.00002761 cm because "that's what the calculator said". Getrting them past that is your first step.

A proper error analysis requires calculus. They haven't taken that, so it is now doubly a fool's errand.

If you got them to draw a graph and to estimate what values of the solpe and intercept are reasonably consistent with the data, I'd call that a win. While I think the hardest thing will be to get them to draw a graph at all, and if they do to make it larger than a postage stamp, but if they fuss over the meaning of
consistent, tell them "a line that's consistent goes through 2/3 to 3/4 of the data points.

If you could get them to do this much, I'd take the win.
 
  • Like
Likes jim mcnamara and Bjarke Nicolaisen
  • #3
Vanadium 50 said:
There is no such thing as a perfect error analysis. Searching for one is a fool's errand.

These students are 14 or 15 yeas old. They will want to write down 40.00002761 cm because "that's what the calculator said". Getrting them past that is your first step.

A proper error analysis requires calculus. They haven't taken that, so it is now doubly a fool's errand.

If you got them to draw a graph and to estimate what values of the solpe and intercept are reasonably consistent with the data, I'd call that a win. While I think the hardest thing will be to get them to draw a graph at all, and if they do to make it larger than a postage stamp, but if they fuss over the meaning of
consistent, tell them "a line that's consistent goes through 2/3 to 3/4 of the data points.

If you could get them to do this much, I'd take the win.
hehe, I am certainly a fool, no doubt about that! But I don't think it is out of the question to let them think about stuff like this. Students in my country range between 15-20 years of age from beginning-end in high school.

My proposed estimate should work even though they haven't yet had calculus, that is the point of it. I understand you can make the point that having them even think about uncertainties in high school is too much (it is not required syllabus). But my experience is that a lot of experimental work lacks meaning without it. You have measured the gravitational acceleration, good, know what is your conclusion? That g is really 9.953496 m/s^2? Or that your experiment "went well" because your result is close to the accepted value? I think this teaches students a completely wrong attitude towards experiements. It first becomes interesting (in my eyes), when you have an estimate of uncertainty to go with you result. Then you can discuss what caused the uncertainty, how to better the construction of the experiment to minimize uncertainties and errors, and you can say "we measured g = 9.9 +- .6 m/s^2", and then discuss whether accepted values are within the uncertainty interval or not, and why that may be. If you have a testable hypothesis going into the experiment, say, "g is equal to 9.82 +- 0.02 m/s^2", how can you even conclude on your hypothesis without an idea of the uncertainty of your result?

But OK, I am not blind to the fact that more fundamental problems probably take precedence when teaching these young students. I maybe just choose not to look too closely :)
 
  • #4
Do the students have access to computers? You could have students repeatedly add random noise to the data and fit a line to the noisy data points to see how the fit parameters fluctuate. It would be straightforward to do using a spreadsheet, and I think easy for students to get the basic idea of what's going on.
 
  • Like
Likes Bjarke Nicolaisen and berkeman
  • #5
vela said:
Do the students have access to computers? You could have students repeatedly add random noise to the data and fit a line to the noisy data points to see how the fit parameters fluctuate. It would be straightforward to do using a spreadsheet, and I think easy for students to get the basic idea of what's going on.

Good idea! Thanks for the suggestion, might be a fun exercise.
 
  • #6
I do not like replacing real measurements with computer simulations. I think that this can be valuable as a demonstrator or a motivator, but there should be a firewall between what is measured and what is simulated.

As far as "doing it right", one can ask how well the "pros" do. It is known that the true value (i.`e. the combination of later, more accurate measurements, is within 1 standard deviation of the quoted early measurements more often than 68%. (An exception is rate measurements from first observations, which tend to be high. However, the fraction of time the results are highly discrepant is higher than Gaussian errors would suggest.

In short, if the "pros" don't get the right answer, how can we expect teenagers to? You can't teach them the correct recipe because the correct recipe does not work. It is a mistake to treat error analysis as anything beyond a reasonable estimate.
 
  • #7
Vanadium 50 said:
I do not like replacing real measurements with computer simulations. I think that this can be valuable as a demonstrator or a motivator, but there should be a firewall between what is measured and what is simulated.

I think for the average student doing the exercise of simulating 1000 data draws and seeing what the distribution of betas looks like would be a great educational exercise. I don't understand this firewall thing you're complaining about.
 
  • #8
Vanadium 50 said:
I do not like replacing real measurements with computer simulations. I think that this can be valuable as a demonstrator or a motivator, but there should be a firewall between what is measured and what is simulated.

As far as "doing it right", one can ask how well the "pros" do. It is known that the true value (i.`e. the combination of later, more accurate measurements, is within 1 standard deviation of the quoted early measurements more often than 68%. (An exception is rate measurements from first observations, which tend to be high. However, the fraction of time the results are highly discrepant is higher than Gaussian errors would suggest.

In short, if the "pros" don't get the right answer, how can we expect teenagers to? You can't teach them the correct recipe because the correct recipe does not work. It is a mistake to treat error analysis as anything beyond a reasonable estimate.
Interesting point about the 1 standard deviation "problem", I have not heard about this before and am surprised. Care to reference?

I think we are maybe in some agreement on teaching these teenagers about uncertainties. My suggestion is indeed an attempt at a "reasonable estimate" of error propagation, that has some of the trademark behaviour of accepted methods like 1/sqrt(data-points)-behaviour, but is not stringently correct. Although you seem to think this approach is already too demanding, I am not too sure of this - but I might change my opinion after exposing my students to it :)

Your last paragraph seems highly controversial to me. Say there really is this problem that you mention with error analysis in science, and "the correct recipe does not work". Surely the accepted methods are still our best choice of practice? To what extent don't they work? It seems to me like your argument boils down to "error analysis is flawed anyway, so we might as well just estimate uncertainties losely/however we want". But what does showing the existence of the higgs boson to 6 sigma then mean, or how can we claim to know so many digits of G, or...?
In my experience in educational practice we often resort to using simplified statistical models for our experiments (such as using results from linear regression assuming zero uncertainty in x), but that is of course not the problem of the statistical methods, rather a problem for us in interpreting our results. To say that there is an intrinsic problem with statistical methods in practice is mind-boggling to me, and implies that precision on scientific results such as the mass/existence of the higgs boson are quantitatively meaningless.
 
  • #9
Simulation is a good thing. Measurements are a good thing. But they aren't the same good thing.

Mixing the two is apt to be confusiing, or worse especially for beginners.

One unfortunate trend is that of replacing (the proponents would say "supplementing") real labs with computer simulations. It's cheaper and easier. But it's not real.

99.99% of the students will never becomes physicists. But 100% will need to understand the difference between what is real and what is simulated.
 
  • Like
Likes gleem and hutchphd
  • #10
Bjarke Nicolaisen said:
Interesting point about the 1 standard deviation "problem", I have not heard about this before and am surprised. Care to reference?
The study I have in mind was done as a meta-analysis by a pair of Harvard (IIRC) grad students. It's on the arXiv, but I am not in a position where I can easily hunt it down. I found it annoying, because the papers they referencve admit as much - that's what "conservative" errors mean: the authors are confident that the quoted errors are more likely too big than too small.

I think the more interesting dataset is the Particle Data Group's plot of various quantities over time. Lots of structure, and lots of different goings on, but what is seldom seen is a gradually improving 1 sigma walk towards the accepted value. Sure, it happens, but is more the exception than the rule.

To restate where I am coming from: I think there is value in students understanding uncertainties. I think there is value in distinguishing a 10% from a 20% uncertainty. I think there is little value indistinguishing a 10% from an 11% uncertainty - our experience shows that we aren't very good at this, and usually there is no meaning in this difference.
 
  • Like
Likes Bjarke Nicolaisen
  • #11
I think a reasonable thing for 14 year olds to do is:
1. Know ro combine independet errors by adding in quadrature,'
2, Recognize that the relative error of the difference of two numbers can be large.
3. Understand that the error in x2 is twice the (relative) error in x.
4, Plot points and their errors and draw a line going through most of the points, Estimate the slope and intercept uncertainties by seeing how much the slope and intercept can change and still have the line mostly go through the points.
5. Idenitify the most important source(s) of uncertainty in a given measurement.

As a bonus, I wou;d hope they know why writing down 100 +/- 51 is highly unlikely to be correct.

I do not think they need to do a full error analysis, even if they had the tools (which they don't) or that there is value in trying to get the One True Estuate of Uncertainty. I think the time it would take to do this well is far better spent on other things.[/sup][/sup]
 
Last edited:
  • Like
Likes Bjarke Nicolaisen
  • #12
Even more fundamental is to make them realize that a number without an estimate of uncertaintiy is absolutely meaningless. The details after that are important but secondary.
 
  • Like
Likes Bjarke Nicolaisen, gleem and jim mcnamara

What is a simple overestimate of slope uncertainty in regression?

A simple overestimate of slope uncertainty in regression is a common statistical error that occurs when calculating the uncertainty or error in the slope of a regression line. It happens when the error is calculated using only the sum of squares of the residuals, without taking into account the number of data points or degrees of freedom.

Why does a simple overestimate of slope uncertainty occur?

A simple overestimate of slope uncertainty occurs because the traditional method of calculating the uncertainty in the slope assumes that the number of data points is equal to the number of parameters being estimated. However, in regression, the number of parameters is always one less than the number of data points, leading to an overestimation of the uncertainty.

What are the consequences of a simple overestimate of slope uncertainty?

The consequences of a simple overestimate of slope uncertainty can be significant. It can lead to incorrect conclusions about the significance of the slope and the strength of the relationship between the variables. It can also affect the accuracy of predictions and the reliability of the regression model.

How can a simple overestimate of slope uncertainty be avoided?

A simple overestimate of slope uncertainty can be avoided by using the correct formula to calculate the uncertainty or error in the slope. This formula takes into account the number of data points and degrees of freedom, providing a more accurate estimate of the uncertainty. Additionally, using robust statistical software can also help avoid this error.

Is a simple overestimate of slope uncertainty a common mistake in regression analysis?

Yes, a simple overestimate of slope uncertainty is a common mistake in regression analysis, especially among beginners. It is important to be aware of this error and to use the correct formula for calculating the uncertainty in the slope to ensure accurate and reliable results.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
842
Replies
8
Views
2K
  • Set Theory, Logic, Probability, Statistics
2
Replies
64
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
495
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
Back
Top