# Extract error from simulated data

• I
• kelly0303
kelly0303
Hello! Say I have some measurements ##y_i = f(x_i|a_1,a_2,...,a_n)## for different values of ##x_i##. Here ##a_i##'s are the parameters of the function I want to fit for. For example for a linear function I would just have ##y_i=ax_i+b=f(x_i|a,b)##. I want to see how the errors on the ##y_i##'s are reflected into errors on the parameters of the function. What is the best way to simulate this? I was thinking to generate some values of ##y_i## based on the function, and then replace each ##y_i## by a randomly generated number, ##y_i'##, from a Gaussian distribution with mean ##y_i## and standard deviation ##\sigma_{y_i}##. Then I would do a least square fit to ##y_i'=f(x_i)##, including the errors on ##y_i'## i.e. ##\sigma_{y_i}## and from there I would get a value for my parameters and an associated error i.e. ##a_i\pm \sigma_{a_i}##. Then I would change the values of ##\sigma_{y_i}## and repeat all this. In this way I would basically have something of the form ##\sigma_{y_i}## vs ##\sigma_{a_i} ##. I am not totally sure tho if this is the best approach. Can someone tell me how to do it? Thank you!

Homework Helper
Gold Member
I have a question about how you are describing this. The ##y_i##s are measured values, not the exact result of the function ##f##, unless ##f## is an exact curve fit through those points. Is that the case? If so, are there only two ##y## values in the simple linear example?

Gold Member
Another follow up. The data ##y_i##, are they each just a single measurement or are they an average measurement + an error bar on ##y_i##?

I have a feeling your issue can be resolved by doing propagation of error on the closed formula for linear regression. This is how both linear and nonlinear regression algorithms calculate uncertainty on the parameters.

kelly0303
I have a question about how you are describing this. The ##y_i##s are measured values, not the exact result of the function ##f##, unless ##f## is an exact curve fit through those points. Is that the case? If so, are there only two ##y## values in the simple linear example?
I don't have any measured data. I just want to simulate some mock data and fit to it. I can have any number of ##(x,y)## pairs. It depends on the problem I am trying to solve, but definitely more than 2.

kelly0303
Another follow up. The data ##y_i##, are they each just a single measurement or are they an average measurement + an error bar on ##y_i##?

I have a feeling your issue can be resolved by doing propagation of error on the closed formula for linear regression. This is how both linear and nonlinear regression algorithms calculate uncertainty on the parameters.
In practice ##y_i## is a frequency measurement. So basically I have a counting experiment (with Poisson error on each point) and then I fit these points with a Voigt/Lorentz profile and extract the error from the fit.

Gold Member
In practice ##y_i## is a frequency measurement. So basically I have a counting experiment (with Poisson error on each point) and then I fit these points with a Voigt/Lorentz profile and extract the error from the fit.
Gotcha

Is this the procedure you have in mind?
• choose some known values of the x-axis: ##x_i## for ##1\leq i \leq N##
• generate artificial y-axis (frequency) measurements ##y_i## that are each normally distributed with standard deviation ##\sigma_{y_i}## and with mean ##ax_i + b## where a and b are values that you choose in the simulation
• do linear regression to find estimates for the uncertainty on the parameters ##\hat{a}## and ##\hat{b}## (the hat on ##\hat{a}## means estimate on ##a## as opposed to the value to you coded into the simulation)
If so, you don't need to simulate data. The Monte Carlo approach you describe can be replaced with error propagation.

Take a skim through this wikipedia page. You'll find expressions in the "Model-based Properties" sections for ##s_{\hat{\alpha}}## and ##s_{\hat{\beta}}## which correspond to the standard deviations on ##\hat{a}## and ##\hat{b}## here.

If you use a nonlinear model in the future, you can use a numerical routine like Matlab's nlinfit which does Levenburg-Marquardt fitting and returns uncertainty based on local linear regression. I don't know a good equivalent to it in scipy off the top of my head :/

Gold Member
Also, this thread is somewhat relevant. It specifically pertains to the case where there is error on the x-axis ("measurement error" is the statistician's term for it).

kelly0303
Gotcha

Is this the procedure you have in mind?
• choose some known values of the x-axis: ##x_i## for ##1\leq i \leq N##
• generate artificial y-axis (frequency) measurements ##y_i## that are each normally distributed with standard deviation ##\sigma_{y_i}## and with mean ##ax_i + b## where a and b are values that you choose in the simulation
• do linear regression to find estimates for the uncertainty on the parameters ##\hat{a}## and ##\hat{b}## (the hat on ##\hat{a}## means estimate on ##a## as opposed to the value to you coded into the simulation)
If so, you don't need to simulate data. The Monte Carlo approach you describe can be replaced with error propagation.

Take a skim through this wikipedia page. You'll find expressions in the "Model-based Properties" sections for ##s_{\hat{\alpha}}## and ##s_{\hat{\beta}}## which correspond to the standard deviations on ##\hat{a}## and ##\hat{b}## here.

If you use a nonlinear model in the future, you can use a numerical routine like Matlab's nlinfit which does Levenburg-Marquardt fitting and returns uncertainty based on local linear regression. I don't know a good equivalent to it in scipy off the top of my head :/
Thank you! Yes, that's exactly what I had in mind. For the linear fit, are you saying that doing a fit or doing error propagations are equivalent? I assume that doing a fit is faster given that I can just use scipy (and 3 lines of code), no?

Why can't I use scipy curve fit for a non-linear function? Their program returns the correlation matrix for the parameters. Isn't that what I am looking for?

• Twigg
Gold Member
For the linear fit, are you saying that doing a fit or doing error propagations are equivalent?
If you did the Monte Carlo approach, you'd need to do a handful of runs of the simulation so you get a nice sampling over many different possibilities for the ##y_i##'s. Instead, you can solve the problem analytically (to first order, at least) by propagating the errors ##\sigma_{y_i}## through the explicit formulas for linear regression. It's just an efficiency thing really. In fact, the Monte Carlo approach doesn't make the first-order approximation that error propagation does. It's really up to you, just thought I'd make the suggestion

Why can't I use scipy curve fit for a non-linear function? Their program returns the correlation matrix for the parameters. Isn't that what I am looking for?
Scipy curve fit looks perfect. I'm just a beginner at scipy kelly0303
If you did the Monte Carlo approach, you'd need to do a handful of runs of the simulation so you get a nice sampling over many different possibilities for the ##y_i##'s. Instead, you can solve the problem analytically (to first order, at least) by propagating the errors ##\sigma_{y_i}## through the explicit formulas for linear regression. It's just an efficiency thing really. In fact, the Monte Carlo approach doesn't make the first-order approximation that error propagation does. It's really up to you, just thought I'd make the suggestion

Scipy curve fit looks perfect. I'm just a beginner at scipy Thank you! So for the case of non-linear fit (or where I can't solve the problem analytically). How should I proceed exactly? For example for a given assumed error on the measurement, say ##1kHz## (assume same error on all data points), I would do the steps above and obtain some estimates and errors for the parameters. Then I would repeat this procedure (same error on the measurements) for a big number of time, say 1000, and get 1000 estimates and errors on the parameters. Then I would use basics statistics to calculate the mean and error from all these 1000 values. Then I would change the error on the measurement (say 100 Hz) and redo the 1000 samples again and so on. Is this what you meant?

• Twigg
Homework Helper
Gold Member
A modern regression analysis would only include terms that show some statistically significant improvement of the estimate. That makes the relationship between the errors of the y values and the errors of the parameter estimates very complicated and discontinuous. I doubt that there is a good analytical approach.

• Twigg
Gold Member
I agree that there's likely no good analytical solution in the general non-linear case (or if there is one, it would be highly model-dependent, like polynomial regression or Fourier analysis). @FactChecker I'm curious about the discontinuity you mentioned, but maybe that's a subject best left to a PM so we don't derail the thread.

@kelly0303 That approach sounds good. I believe you could get a great estimate by just doing one iteration of the steps above, where the y-values are exactly equal to the model function ##y_i = f(x_i | a,b,c,...)## but with inverse-variance weighting scheme given by the known error (1000kHz in your first example) squared (1e12 Hz^2). I can't remember, but I think this will either give you a fit with a smallish covariance matrix (small by a factor of <3 I'd bet) or it'll make the Levenburg-Marquardt iteration diverge. I forget which but worth a try!

kelly0303
I agree that there's likely no good analytical solution in the general non-linear case (or if there is one, it would be highly model-dependent, like polynomial regression or Fourier analysis). @FactChecker I'm curious about the discontinuity you mentioned, but maybe that's a subject best left to a PM so we don't derail the thread.

@kelly0303 That approach sounds good. I believe you could get a great estimate by just doing one iteration of the steps above, where the y-values are exactly equal to the model function ##y_i = f(x_i | a,b,c,...)## but with inverse-variance weighting scheme given by the known error (1000kHz in your first example) squared (1e12 Hz^2). I can't remember, but I think this will either give you a fit with a smallish covariance matrix (small by a factor of <3 I'd bet) or it'll make the Levenburg-Marquardt iteration diverge. I forget which but worth a try!
But if I use the exact values for ##y_i##, without adding some extra noise, won't the error on the parameters be smaller, even if I add the inverse weighting (in my idea I would add both the noise and the inverse weighting)? Is that factor of 3 because of that?

Also, what do you mean by just one iteration? You mean not do that 1000 times for each error value on ##y_i##? Would that work? For some reason I thought that in Monte Carlo approach (not sure if this counts as MC tho), I need lots of samples and then average over them. Would just one sample for each error give me the same result as many samples and averaging?

• Twigg
Gold Member
Would just one sample for each error give me the same result as many samples and averaging?
One sample wouldn't give you the full story, as you say.

But if I use the exact values for yi, without adding some extra noise, won't the error on the parameters be smaller, even if I add the inverse weighting (in my idea I would add both the noise and the inverse weighting)? Is that factor of 3 because of that?
Yes, it will be smaller. That's why I threw out that factor of less than 3ish (just a total guess on my part).

The idea was if you were in a hurry to get a number, just a rough guess, you could do a single iteration like I was saying. It should be same order as the result you would get with your procedure, just a little small. You have the right idea

not sure if this counts as MC
I probably overuse the term! Something long those lines though.