Extract error from simulated data

  • I
  • Thread starter kelly0303
  • Start date
  • Tags
    Data Error
In summary: Matlab's nlinfit which does Levenburg-Marquardt fitting and returns uncertainty based on local linear regression.If you use a nonlinear model in the future, you can use a numerical routine like Matlab's nlinfit which does Levenburg-Marquardt fitting and returns uncertainty based on local linear regression.
  • #1
kelly0303
561
33
Hello! Say I have some measurements ##y_i = f(x_i|a_1,a_2,...,a_n)## for different values of ##x_i##. Here ##a_i##'s are the parameters of the function I want to fit for. For example for a linear function I would just have ##y_i=ax_i+b=f(x_i|a,b)##. I want to see how the errors on the ##y_i##'s are reflected into errors on the parameters of the function. What is the best way to simulate this? I was thinking to generate some values of ##y_i## based on the function, and then replace each ##y_i## by a randomly generated number, ##y_i'##, from a Gaussian distribution with mean ##y_i## and standard deviation ##\sigma_{y_i}##. Then I would do a least square fit to ##y_i'=f(x_i)##, including the errors on ##y_i'## i.e. ##\sigma_{y_i}## and from there I would get a value for my parameters and an associated error i.e. ##a_i\pm \sigma_{a_i}##. Then I would change the values of ##\sigma_{y_i}## and repeat all this. In this way I would basically have something of the form ##\sigma_{y_i}## vs ##\sigma_{a_i} ##. I am not totally sure tho if this is the best approach. Can someone tell me how to do it? Thank you!
 
Physics news on Phys.org
  • #2
I have a question about how you are describing this. The ##y_i##s are measured values, not the exact result of the function ##f##, unless ##f## is an exact curve fit through those points. Is that the case? If so, are there only two ##y## values in the simple linear example?
 
  • #3
Another follow up. The data ##y_i##, are they each just a single measurement or are they an average measurement + an error bar on ##y_i##?

I have a feeling your issue can be resolved by doing propagation of error on the closed formula for linear regression. This is how both linear and nonlinear regression algorithms calculate uncertainty on the parameters.
 
  • #4
FactChecker said:
I have a question about how you are describing this. The ##y_i##s are measured values, not the exact result of the function ##f##, unless ##f## is an exact curve fit through those points. Is that the case? If so, are there only two ##y## values in the simple linear example?
I don't have any measured data. I just want to simulate some mock data and fit to it. I can have any number of ##(x,y)## pairs. It depends on the problem I am trying to solve, but definitely more than 2.
 
  • #5
Twigg said:
Another follow up. The data ##y_i##, are they each just a single measurement or are they an average measurement + an error bar on ##y_i##?

I have a feeling your issue can be resolved by doing propagation of error on the closed formula for linear regression. This is how both linear and nonlinear regression algorithms calculate uncertainty on the parameters.
In practice ##y_i## is a frequency measurement. So basically I have a counting experiment (with Poisson error on each point) and then I fit these points with a Voigt/Lorentz profile and extract the error from the fit.
 
  • #6
kelly0303 said:
In practice ##y_i## is a frequency measurement. So basically I have a counting experiment (with Poisson error on each point) and then I fit these points with a Voigt/Lorentz profile and extract the error from the fit.
Gotcha

Is this the procedure you have in mind?
  • choose some known values of the x-axis: ##x_i## for ##1\leq i \leq N##
  • generate artificial y-axis (frequency) measurements ##y_i## that are each normally distributed with standard deviation ##\sigma_{y_i}## and with mean ##ax_i + b## where a and b are values that you choose in the simulation
  • do linear regression to find estimates for the uncertainty on the parameters ##\hat{a}## and ##\hat{b}## (the hat on ##\hat{a}## means estimate on ##a## as opposed to the value to you coded into the simulation)
If so, you don't need to simulate data. The Monte Carlo approach you describe can be replaced with error propagation.

Take a skim through this wikipedia page. You'll find expressions in the "Model-based Properties" sections for ##s_{\hat{\alpha}}## and ##s_{\hat{\beta}}## which correspond to the standard deviations on ##\hat{a}## and ##\hat{b}## here.

If you use a nonlinear model in the future, you can use a numerical routine like Matlab's nlinfit which does Levenburg-Marquardt fitting and returns uncertainty based on local linear regression. I don't know a good equivalent to it in scipy off the top of my head :/
 
  • #7
Also, this thread is somewhat relevant. It specifically pertains to the case where there is error on the x-axis ("measurement error" is the statistician's term for it).
 
  • #8
Twigg said:
Gotcha

Is this the procedure you have in mind?
  • choose some known values of the x-axis: ##x_i## for ##1\leq i \leq N##
  • generate artificial y-axis (frequency) measurements ##y_i## that are each normally distributed with standard deviation ##\sigma_{y_i}## and with mean ##ax_i + b## where a and b are values that you choose in the simulation
  • do linear regression to find estimates for the uncertainty on the parameters ##\hat{a}## and ##\hat{b}## (the hat on ##\hat{a}## means estimate on ##a## as opposed to the value to you coded into the simulation)
If so, you don't need to simulate data. The Monte Carlo approach you describe can be replaced with error propagation.

Take a skim through this wikipedia page. You'll find expressions in the "Model-based Properties" sections for ##s_{\hat{\alpha}}## and ##s_{\hat{\beta}}## which correspond to the standard deviations on ##\hat{a}## and ##\hat{b}## here.

If you use a nonlinear model in the future, you can use a numerical routine like Matlab's nlinfit which does Levenburg-Marquardt fitting and returns uncertainty based on local linear regression. I don't know a good equivalent to it in scipy off the top of my head :/
Thank you! Yes, that's exactly what I had in mind. For the linear fit, are you saying that doing a fit or doing error propagations are equivalent? I assume that doing a fit is faster given that I can just use scipy (and 3 lines of code), no?

Why can't I use scipy curve fit for a non-linear function? Their program returns the correlation matrix for the parameters. Isn't that what I am looking for?
 
  • Like
Likes Twigg
  • #9
kelly0303 said:
For the linear fit, are you saying that doing a fit or doing error propagations are equivalent?
If you did the Monte Carlo approach, you'd need to do a handful of runs of the simulation so you get a nice sampling over many different possibilities for the ##y_i##'s. Instead, you can solve the problem analytically (to first order, at least) by propagating the errors ##\sigma_{y_i}## through the explicit formulas for linear regression. It's just an efficiency thing really. In fact, the Monte Carlo approach doesn't make the first-order approximation that error propagation does. It's really up to you, just thought I'd make the suggestion

kelly0303 said:
Why can't I use scipy curve fit for a non-linear function? Their program returns the correlation matrix for the parameters. Isn't that what I am looking for?
Scipy curve fit looks perfect. I'm just a beginner at scipy o:)
 
  • #10
Twigg said:
If you did the Monte Carlo approach, you'd need to do a handful of runs of the simulation so you get a nice sampling over many different possibilities for the ##y_i##'s. Instead, you can solve the problem analytically (to first order, at least) by propagating the errors ##\sigma_{y_i}## through the explicit formulas for linear regression. It's just an efficiency thing really. In fact, the Monte Carlo approach doesn't make the first-order approximation that error propagation does. It's really up to you, just thought I'd make the suggestionScipy curve fit looks perfect. I'm just a beginner at scipy o:)
Thank you! So for the case of non-linear fit (or where I can't solve the problem analytically). How should I proceed exactly? For example for a given assumed error on the measurement, say ##1kHz## (assume same error on all data points), I would do the steps above and obtain some estimates and errors for the parameters. Then I would repeat this procedure (same error on the measurements) for a big number of time, say 1000, and get 1000 estimates and errors on the parameters. Then I would use basics statistics to calculate the mean and error from all these 1000 values. Then I would change the error on the measurement (say 100 Hz) and redo the 1000 samples again and so on. Is this what you meant?
 
  • Like
Likes Twigg
  • #11
A modern regression analysis would only include terms that show some statistically significant improvement of the estimate. That makes the relationship between the errors of the y values and the errors of the parameter estimates very complicated and discontinuous. I doubt that there is a good analytical approach.
 
  • Like
Likes Twigg
  • #12
I agree that there's likely no good analytical solution in the general non-linear case (or if there is one, it would be highly model-dependent, like polynomial regression or Fourier analysis). @FactChecker I'm curious about the discontinuity you mentioned, but maybe that's a subject best left to a PM so we don't derail the thread.

@kelly0303 That approach sounds good. I believe you could get a great estimate by just doing one iteration of the steps above, where the y-values are exactly equal to the model function ##y_i = f(x_i | a,b,c,...)## but with inverse-variance weighting scheme given by the known error (1000kHz in your first example) squared (1e12 Hz^2). I can't remember, but I think this will either give you a fit with a smallish covariance matrix (small by a factor of <3 I'd bet) or it'll make the Levenburg-Marquardt iteration diverge. I forget which :oldbiggrin: but worth a try!
 
  • #13
Twigg said:
I agree that there's likely no good analytical solution in the general non-linear case (or if there is one, it would be highly model-dependent, like polynomial regression or Fourier analysis). @FactChecker I'm curious about the discontinuity you mentioned, but maybe that's a subject best left to a PM so we don't derail the thread.

@kelly0303 That approach sounds good. I believe you could get a great estimate by just doing one iteration of the steps above, where the y-values are exactly equal to the model function ##y_i = f(x_i | a,b,c,...)## but with inverse-variance weighting scheme given by the known error (1000kHz in your first example) squared (1e12 Hz^2). I can't remember, but I think this will either give you a fit with a smallish covariance matrix (small by a factor of <3 I'd bet) or it'll make the Levenburg-Marquardt iteration diverge. I forget which :oldbiggrin: but worth a try!
But if I use the exact values for ##y_i##, without adding some extra noise, won't the error on the parameters be smaller, even if I add the inverse weighting (in my idea I would add both the noise and the inverse weighting)? Is that factor of 3 because of that?

Also, what do you mean by just one iteration? You mean not do that 1000 times for each error value on ##y_i##? Would that work? For some reason I thought that in Monte Carlo approach (not sure if this counts as MC tho), I need lots of samples and then average over them. Would just one sample for each error give me the same result as many samples and averaging?
 
  • Like
Likes Twigg
  • #14
kelly0303 said:
Would just one sample for each error give me the same result as many samples and averaging?
One sample wouldn't give you the full story, as you say.

kelly0303 said:
But if I use the exact values for yi, without adding some extra noise, won't the error on the parameters be smaller, even if I add the inverse weighting (in my idea I would add both the noise and the inverse weighting)? Is that factor of 3 because of that?
Yes, it will be smaller. That's why I threw out that factor of less than 3ish (just a total guess on my part).

The idea was if you were in a hurry to get a number, just a rough guess, you could do a single iteration like I was saying. It should be same order as the result you would get with your procedure, just a little small. You have the right idea

kelly0303 said:
not sure if this counts as MC
I probably overuse the term! Something long those lines though.
 

1. What is simulated data?

Simulated data is artificial data that is created using a computer program or mathematical model to mimic real-world data. It is often used in scientific research to test hypotheses or to evaluate the performance of statistical methods.

2. How is error extracted from simulated data?

Error can be extracted from simulated data by comparing it to the known ground truth or expected values. This can be done by calculating the difference between the simulated data and the expected values, or by using statistical methods such as regression analysis or hypothesis testing.

3. Why is it important to extract error from simulated data?

Extracting error from simulated data allows researchers to evaluate the accuracy and reliability of their simulations. It also helps in identifying potential flaws or biases in the simulation model, which can then be addressed to improve the validity of the results.

4. What are some common sources of error in simulated data?

Common sources of error in simulated data include simplifying assumptions made in the simulation model, inaccuracies in the input parameters or data, and limitations in the computational power or software used to run the simulation.

5. How can error be minimized in simulated data?

To minimize error in simulated data, it is important to carefully select and validate the input parameters and data used in the simulation. It is also helpful to run multiple simulations with varying parameters to assess the sensitivity of the results. Additionally, using more advanced simulation techniques and improving the computational resources can also help reduce error in simulated data.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
909
  • Set Theory, Logic, Probability, Statistics
Replies
19
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
24
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
963
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
2K
  • STEM Educators and Teaching
Replies
11
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
Replies
3
Views
1K
Back
Top