I Scaling covariance for an error calculation

AI Thread Summary
The discussion revolves around the use of the scale_covar parameter in fitting programs like Python's lmfit and scipy for error calculations in linear fits. Setting scale_covar to True adjusts the covariance matrix to achieve a chi-square of 1, while setting it to False allows the errors on parameters to vary with changes in y errors. The user is uncertain whether to report larger errors from scale_covar=True or smaller ones from scale_covar=False, especially given their confidence in the data and fit quality. It is suggested that if the chi-square value is significantly high, it may indicate issues with the data or model, and using scaled uncertainties should be approached cautiously. Ultimately, if the errors are deemed too small relative to the data spread, a conservative approach is to report the larger uncertainties.
Malamala
Messages
345
Reaction score
28
Hello! I just discovered (maybe a bit late) that most fitting programs (Python lmfit or scipy, for example) have a parameter (by default turned on) that allows a scaling of the covariance matrix for calculating the errors (usually called scale_covar or something similar). After some reading I figured out (hopefully correctly) that setting that parameter on (scale_covar=True) means basically to adjust the errors on the data until the chi-square would be 1, and report the errors on parameters using these adjusted values. I have noticed that doing so, if you scale all the y error by the same amount the error on the parameter fit doesn't change. On the other hand if I set the parameter off (scale_covar=False), scaling the errors on y changes the errors on the parameters of the fit, too.

In my case I need ,to do a linear fit to some data. If I use scale_covar=True (which is the default) I get something around (ignoring decimals) ##25 \pm 1##. If I set it to False I get half the error ##25 \pm 0.5##. I am quite confident about the errors on my points and the fit looks good. Which value should I report? And in general, for any fit, when should I set this parameter to True and when to False?

Lastly, I don't really remember reading any experimental physics paper where they actually talk about which of these 2 methods they use to get the errors from a fit. They just state the errors on the parameters. Is there a generally agreed upon way of doing this, such that everyone is setting that parameter (in their fitting program) to True or False? And if so what is the convention.

In principle, I would like to know, if I were to publish my data in a journal (say PRL), without talking about this covariance scaling stuff (as no one does, it seems) should I use (in my linear case fit) 0.5 or 1 for the error?

Thank you!
 
Physics news on Phys.org
If that scaling is significant something went wrong with your uncertainties.
The particle data group scales up the uncertainty if the measurements are incompatible (not sure what exactly their threshold is) and points out that it did so in the few cases where this is necessary.
 
mfb said:
If that scaling is significant something went wrong with your uncertainties.
The particle data group scales up the uncertainty if the measurements are incompatible (not sure what exactly their threshold is) and points out that it did so in the few cases where this is necessary.
Thank you for your reply. So should I use scaling when I am not sure that my errors are correct? But in my case I am not sure what would be incompatible with what. I just have 5 data points with errors on them and I want to fit a straight line to it. I don't have 2 sets of measurements or something to compare with such that I could say that something is not compatible. But, regardless of that, could you please explain to me when one should use scaling and when not? I am not sure I understand that. Thank you!
 
Maybe a straight line is a bad assumption or you underestimate your uncertainties on these points.
If in doubt be conservative and give the larger uncertainty.
 
mfb said:
Maybe a straight line is a bad assumption or you underestimate your uncertainties on these points.
If in doubt be conservative and give the larger uncertainty.
But I am still not sure I understand when I should use scaling and when not. Could you explain that a bit (or point me towards some readings)? Thank you!
 
If it is significant you should first try to find the errors in your data points or your model to describe the data points. You might fit nonsense. If you do not then the scaling shouldn't be a factor 2.
Check Anscombe's quartet. 4 datasets that all give the same straight line fit - but in three cases that fit is clearly the wrong approach. Here are more examples.

If you can't find any reason why your ##\chi^2/ndf## is so bad then it might be acceptable to use the scaled up uncertainties to get something, but you should discuss this explicitly because it means something went wrong somewhere.
 
mfb said:
If it is significant you should first try to find the errors in your data points or your model to describe the data points. You might fit nonsense. If you do not then the scaling shouldn't be a factor 2.
Check Anscombe's quartet. 4 datasets that all give the same straight line fit - but in three cases that fit is clearly the wrong approach. Here are more examples.

If you can't find any reason why your ##\chi^2/ndf## is so bad then it might be acceptable to use the scaled up uncertainties to get something, but you should discuss this explicitly because it means something went wrong somewhere.
Thank you again for your reply! Here is my actual data that I am trying to fit:
##x = [-0.312, -0.217, -0.081, 0., 0.211]##
##y = [-8.050, -5.278 , -3.510, 0., 5.521]##
##y_{err} = [0.121, 0.218, 0.421, 0.115, 0.305]##

I also attached the plot with the fit I get. (I am using the lmfit package in python). When I use scaled covariance I am getting these parameters for the fit:
line1intercept: -0.02108874 +/- 0.20201635
line1slope: 25.6603677 +/- 0.95263756
when I don't use scaled covariance I am getting this:
line1intercept: -0.02108874 +/- 0.09520060
line1slope: 25.6603677 +/- 0.44893231

The reduced chi squared is 4.50291419. Is there something I am doing wrong? To me the fit looks pretty good (and there are some theoretical motivations for a straight fit, too). Thank you, again, for the help.

Screen Shot 2019-12-16 at 02.35.22.png
 
The central data point is a bit over 3 standard deviations away from the fit. If you think this is a good data point and nothing went wrong with it use the scaled up uncertainties. Most likely something went wrong there, so it is better to be conservative.
 
mfb said:
The central data point is a bit over 3 standard deviations away from the fit. If you think this is a good data point and nothing went wrong with it use the scaled up uncertainties. Most likely something went wrong there, so it is better to be conservative.
Thank you for this! I digged a bit deeper into my problem, more specifically where are my errors coming from. I have a counting measurement and I fit my points with a curve. From there I extract the centers of the peaks for different measurements and the plot I showed previously shows the difference between such centers. So the errors there come from the error on the centers of the peaks i.e. from the errors given by the fitting program (which is the same as before, lmfit). I attach below such a counting fit. I have noticed that here I have exactly the same problem as before: scaling my errors give me twice the errors on the parameters (mainly the peak center) compared to when I don't scale them. So the issue I mentioned above, I think, it's just a propagation of the issue from here. In the plot below, the errors are just Poisson i.e. square root of the number of counts. Also the fit is motivated on theoretical grounds. Do you know why do I have this problem here, in the first place when in principle, both the fit and the errors should be right? Thank you so much for your help!

Screen Shot 2019-12-16 at 20.22.42.png
 
  • #10
Your uncertainties are clearly smaller than the spread of the measurements. Something makes them spread by more than the square root of the counts. In addition the fit doesn't do a good job in the two larger peaks and it overestimates the flat area to the right of the last peak.
 
  • Like
Likes BvU
Back
Top