Extract error from simulated data

kelly0303 · Jun 22, 2021

Hello! Say I have some measurements ##y_i = f(x_i|a_1,a_2,...,a_n)## for different values of ##x_i##. Here ##a_i##'s are the parameters of the function I want to fit for. For example for a linear function I would just have ##y_i=ax_i+b=f(x_i|a,b)##. I want to see how the errors on the ##y_i##'s are reflected into errors on the parameters of the function. What is the best way to simulate this? I was thinking to generate some values of ##y_i## based on the function, and then replace each ##y_i## by a randomly generated number, ##y_i'##, from a Gaussian distribution with mean ##y_i## and standard deviation ##\sigma_{y_i}##. Then I would do a least square fit to ##y_i'=f(x_i)##, including the errors on ##y_i'## i.e. ##\sigma_{y_i}## and from there I would get a value for my parameters and an associated error i.e. ##a_i\pm \sigma_{a_i}##. Then I would change the values of ##\sigma_{y_i}## and repeat all this. In this way I would basically have something of the form ##\sigma_{y_i}## vs ##\sigma_{a_i} ##. I am not totally sure tho if this is the best approach. Can someone tell me how to do it? Thank you!

FactChecker · Jun 22, 2021

I have a question about how you are describing this. The ##y_i##s are measured values, not the exact result of the function ##f##, unless ##f## is an exact curve fit through those points. Is that the case? If so, are there only two ##y## values in the simple linear example?

Twigg · Jun 22, 2021

Another follow up. The data ##y_i##, are they each just a single measurement or are they an average measurement + an error bar on ##y_i##?

I have a feeling your issue can be resolved by doing propagation of error on the closed formula for linear regression. This is how both linear and nonlinear regression algorithms calculate uncertainty on the parameters.

kelly0303 · Jun 22, 2021

FactChecker said:

I have a question about how you are describing this. The ##y_i##s are measured values, not the exact result of the function ##f##, unless ##f## is an exact curve fit through those points. Is that the case? If so, are there only two ##y## values in the simple linear example?

I don't have any measured data. I just want to simulate some mock data and fit to it. I can have any number of ##(x,y)## pairs. It depends on the problem I am trying to solve, but definitely more than 2.

kelly0303 · Jun 22, 2021

Twigg said:

Another follow up. The data ##y_i##, are they each just a single measurement or are they an average measurement + an error bar on ##y_i##?

I have a feeling your issue can be resolved by doing propagation of error on the closed formula for linear regression. This is how both linear and nonlinear regression algorithms calculate uncertainty on the parameters.

In practice ##y_i## is a frequency measurement. So basically I have a counting experiment (with Poisson error on each point) and then I fit these points with a Voigt/Lorentz profile and extract the error from the fit.

Twigg · Jun 22, 2021

kelly0303 said:

In practice ##y_i## is a frequency measurement. So basically I have a counting experiment (with Poisson error on each point) and then I fit these points with a Voigt/Lorentz profile and extract the error from the fit.

Gotcha

Is this the procedure you have in mind?

choose some known values of the x-axis: ##x_i## for ##1\leq i \leq N##
generate artificial y-axis (frequency) measurements ##y_i## that are each normally distributed with standard deviation ##\sigma_{y_i}## and with mean ##ax_i + b## where a and b are values that you choose in the simulation
do linear regression to find estimates for the uncertainty on the parameters ##\hat{a}## and ##\hat{b}## (the hat on ##\hat{a}## means estimate on ##a## as opposed to the value to you coded into the simulation)

If so, you don't need to simulate data. The Monte Carlo approach you describe can be replaced with error propagation.

Take a skim through this wikipedia page. You'll find expressions in the "Model-based Properties" sections for ##s_{\hat{\alpha}}## and ##s_{\hat{\beta}}## which correspond to the standard deviations on ##\hat{a}## and ##\hat{b}## here.

If you use a nonlinear model in the future, you can use a numerical routine like Matlab's nlinfit which does Levenburg-Marquardt fitting and returns uncertainty based on local linear regression. I don't know a good equivalent to it in scipy off the top of my head :/

Twigg · Jun 22, 2021

Also, this thread is somewhat relevant. It specifically pertains to the case where there is error on the x-axis ("measurement error" is the statistician's term for it).

kelly0303 · Jun 22, 2021

Twigg said:

Gotcha

Is this the procedure you have in mind?

choose some known values of the x-axis: ##x_i## for ##1\leq i \leq N##

generate artificial y-axis (frequency) measurements ##y_i## that are each normally distributed with standard deviation ##\sigma_{y_i}## and with mean ##ax_i + b## where a and b are values that you choose in the simulation

do linear regression to find estimates for the uncertainty on the parameters ##\hat{a}## and ##\hat{b}## (the hat on ##\hat{a}## means estimate on ##a## as opposed to the value to you coded into the simulation)

If so, you don't need to simulate data. The Monte Carlo approach you describe can be replaced with error propagation.

Take a skim through this wikipedia page. You'll find expressions in the "Model-based Properties" sections for ##s_{\hat{\alpha}}## and ##s_{\hat{\beta}}## which correspond to the standard deviations on ##\hat{a}## and ##\hat{b}## here.

If you use a nonlinear model in the future, you can use a numerical routine like Matlab's nlinfit which does Levenburg-Marquardt fitting and returns uncertainty based on local linear regression. I don't know a good equivalent to it in scipy off the top of my head :/

Thank you! Yes, that's exactly what I had in mind. For the linear fit, are you saying that doing a fit or doing error propagations are equivalent? I assume that doing a fit is faster given that I can just use scipy (and 3 lines of code), no?

Why can't I use scipy curve fit for a non-linear function? Their program returns the correlation matrix for the parameters. Isn't that what I am looking for?

Twigg · Jun 23, 2021

kelly0303 said:

For the linear fit, are you saying that doing a fit or doing error propagations are equivalent?

If you did the Monte Carlo approach, you'd need to do a handful of runs of the simulation so you get a nice sampling over many different possibilities for the ##y_i##'s. Instead, you can solve the problem analytically (to first order, at least) by propagating the errors ##\sigma_{y_i}## through the explicit formulas for linear regression. It's just an efficiency thing really. In fact, the Monte Carlo approach doesn't make the first-order approximation that error propagation does. It's really up to you, just thought I'd make the suggestion

kelly0303 said:

Why can't I use scipy curve fit for a non-linear function? Their program returns the correlation matrix for the parameters. Isn't that what I am looking for?

Scipy curve fit looks perfect. I'm just a beginner at scipy o:)

kelly0303 · Jun 24, 2021

Twigg said:

If you did the Monte Carlo approach, you'd need to do a handful of runs of the simulation so you get a nice sampling over many different possibilities for the ##y_i##'s. Instead, you can solve the problem analytically (to first order, at least) by propagating the errors ##\sigma_{y_i}## through the explicit formulas for linear regression. It's just an efficiency thing really. In fact, the Monte Carlo approach doesn't make the first-order approximation that error propagation does. It's really up to you, just thought I'd make the suggestionScipy curve fit looks perfect. I'm just a beginner at scipy

Thank you! So for the case of non-linear fit (or where I can't solve the problem analytically). How should I proceed exactly? For example for a given assumed error on the measurement, say ##1kHz## (assume same error on all data points), I would do the steps above and obtain some estimates and errors for the parameters. Then I would repeat this procedure (same error on the measurements) for a big number of time, say 1000, and get 1000 estimates and errors on the parameters. Then I would use basics statistics to calculate the mean and error from all these 1000 values. Then I would change the error on the measurement (say 100 Hz) and redo the 1000 samples again and so on. Is this what you meant?

FactChecker · Jun 24, 2021

A modern regression analysis would only include terms that show some statistically significant improvement of the estimate. That makes the relationship between the errors of the y values and the errors of the parameter estimates very complicated and discontinuous. I doubt that there is a good analytical approach.

Twigg · Jun 24, 2021

I agree that there's likely no good analytical solution in the general non-linear case (or if there is one, it would be highly model-dependent, like polynomial regression or Fourier analysis). @FactChecker I'm curious about the discontinuity you mentioned, but maybe that's a subject best left to a PM so we don't derail the thread.

@kelly0303 That approach sounds good. I believe you could get a great estimate by just doing one iteration of the steps above, where the y-values are exactly equal to the model function ##y_i = f(x_i | a,b,c,...)## but with inverse-variance weighting scheme given by the known error (1000kHz in your first example) squared (1e12 Hz^2). I can't remember, but I think this will either give you a fit with a smallish covariance matrix (small by a factor of <3 I'd bet) or it'll make the Levenburg-Marquardt iteration diverge. I forget which :oldbiggrin:

but worth a try!

kelly0303 · Jun 25, 2021

Twigg said:

I agree that there's likely no good analytical solution in the general non-linear case (or if there is one, it would be highly model-dependent, like polynomial regression or Fourier analysis). @FactChecker I'm curious about the discontinuity you mentioned, but maybe that's a subject best left to a PM so we don't derail the thread.

@kelly0303 That approach sounds good. I believe you could get a great estimate by just doing one iteration of the steps above, where the y-values are exactly equal to the model function ##y_i = f(x_i | a,b,c,...)## but with inverse-variance weighting scheme given by the known error (1000kHz in your first example) squared (1e12 Hz^2). I can't remember, but I think this will either give you a fit with a smallish covariance matrix (small by a factor of <3 I'd bet) or it'll make the Levenburg-Marquardt iteration diverge. I forget which but worth a try!

But if I use the exact values for ##y_i##, without adding some extra noise, won't the error on the parameters be smaller, even if I add the inverse weighting (in my idea I would add both the noise and the inverse weighting)? Is that factor of 3 because of that?

Also, what do you mean by just one iteration? You mean not do that 1000 times for each error value on ##y_i##? Would that work? For some reason I thought that in Monte Carlo approach (not sure if this counts as MC tho), I need lots of samples and then average over them. Would just one sample for each error give me the same result as many samples and averaging?

Twigg · Jun 25, 2021

kelly0303 said:

Would just one sample for each error give me the same result as many samples and averaging?

One sample wouldn't give you the full story, as you say.

kelly0303 said:

But if I use the exact values for yi, without adding some extra noise, won't the error on the parameters be smaller, even if I add the inverse weighting (in my idea I would add both the noise and the inverse weighting)? Is that factor of 3 because of that?

Yes, it will be smaller. That's why I threw out that factor of less than 3ish (just a total guess on my part).

The idea was if you were in a hurry to get a number, just a rough guess, you could do a single iteration like I was saying. It should be same order as the result you would get with your procedure, just a little small. You have the right idea

kelly0303 said:

not sure if this counts as MC

I probably overuse the term! Something long those lines though.

Extract error from simulated data

1. What is simulated data?

2. How is error extracted from simulated data?

3. Why is it important to extract error from simulated data?

4. What are some common sources of error in simulated data?

5. How can error be minimized in simulated data?

Similar threads

Hot Threads

Recent Insights