Fitting distribution to histogram with low number of counts

deccard · Feb 24, 2015

I have one dimensional binned data that has a peak to which I need to fit a distribution, such as Gaussian or Lorentzian, that is described with four parameters, height, width, centroid position and the background. The problem is that the counts per bin are low and the peak is only 5-6 bins wide in a quite large amount of background. The bins belonging to the background have usually zero or one counts, and the bin containing the largest amount of counts has about 15 counts.

Counts in each bin are Poisson distributed as the counts are random and independent. The background has the same average rate through out the data.

The real world example could be a spot of normally spread long lived radioactive material on ground that is measure by moving a scintillator detector through the spot one meter at a time keeping the measurement time the same in each measurement.

My first idea was to fit the data using non-linear least square method but This, I guess, will fail because of the low number of counts per bin thus the values are not normally distributed but Poisson distributed? Especially a weighted least square method would not work.

Could I use maximum likelihood estimation method to fit a distribution? I would need to get, let's say, 95% or 68% confidence intervals for the fit parameter in addition to the parameter themselves especially for the background.

In practice I am using Matlab for the fitting. http://se.mathworks.com/help/stats/mle.html

An example data:

0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 2, 2, 1, 0, 0, 1, 2, 0, 1, 0, 2, 0, 0, 1, 0, 0, 0, 2, 0, 0, 1, 0, 2, 2, 1, 3, 10, 15, 5, 6, 2, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 0, 0, 2, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 3, 0, 0, 1, 0

Also, could I use MLE safely for an ash arbitrarily shaped distribution?

Simon Bridge · Feb 25, 2015

You should also have collected a background run ... you know the uncertainties for each bin though.

Your main problem is how to fit a curve with few data points ... start by looking at the data and making some guesses, that will inform your decision-making.
i.e. how big is the background going to be? How wide would you expect to find the FWHM?

With so few actual data (vs background) you can probably do a fit by hand.

What is the context of the data - is this real world data or have you just been handed some numbers?
What is the intended use of the resulting fit? i.e. is it just to be turned in as an assignment?
What level is this to be done at?

deccard · Feb 25, 2015

In principle, if I wanted to get the background only, I could take the counts away from the peak and determine background. Still, there should be a possibility to fit the data so that the background level is included in the fitted profile.

I know the uncertainties for each bin. As the number of counts is low per bin I cannot use just the approximate √N for errors and use weighted least squares fitting. That will fail. For example, the 0 count bins would produce 0 error causing infinite weight to those points. The proper error for each bin is asymmetric making weighted least squares method for fitting inappropriate.

The mean value of the Poisson distribution that produces 0 counts, for example, can be said to have to be between 0 and 3.689 at 95% confidence interval.This can be calculated by using the Chi-squared distribution.

The data that I gave was actually just my simulated data that I produced Matlab using Poisson distribution. The point for me is to learn to fit this data with the most appropriate method from the point of statistics, so that I would get an lower and upper bound, with some confidence interval, for the fitting parameters of the distribution that includes the background.

Simon Bridge · Feb 25, 2015

i.e. how big is the background going to be? How wide would you expect to find the FWHM?

It sounds like you can think of all the ways this can be hard. Try thinking of ways it could be simple.
i.e. you have two distributions summed really - the distribution of the background and the distribution giving rise to the peak.
So how is the background distributed?

You are asked to fit "a distribution" ... presumably not all of them?
Which sort of distribution seems appropriate here?

It strikes me that this is an assignment more about how you make these decisions than it is about your ability to use algorithms.

deccard · Feb 25, 2015

Simon Bridge said:

i.e. how big is the background going to be? How wide would you expect to find the FWHM?

So how is the background distributed?

You are asked to fit "a distribution" ... presumably not all of them?
Which sort of distribution seems appropriate here?
.

Yes, I should have said that the example data was made using Gaussian distribution.

The simulated data is done with Matlab command: poissrnd(ones(1,length(x))*0.4+13*exp(-(x-8.7).^2/(1.89^2))), where the x = 0 to 100 in steps of 1 (x=0:1:100)

That means that each bin has background counts taken from Poisson distribution with the mean of 0.4. That is how big the background is. This is because random counting processes are Poisson processes and it is counting processes, such as taking events from a radiation monitor, that I want to analyze data for.

On top of the background there is a peak generated by the Gaussian function that has the FWHM of [tex]1.89\sqrt{ln(2)}[/tex], or approximately 1.5735.

These I know, of course, because I have generated the data myself. But let's say that this was an actual measured data. Then I would not know these values.

I would perhaps know that the peak should be Gaussian. Then I would want to fit function

$$f(x) = a_{0} + a_{1}e^{-\big(\frac{x-a_3}{a_4}\big)^2}$$

to the data.

By knowing that the counts in the bins are Poisson distributed, I should be able to fit it and then get an upper and lower bounds with certain confidence intervals for the fitting parameters.

The Gaussian was just an example. I don't want to be restricted to only gaussians. I would like to know the way to fit also it if the peak was Lorenzian instaed.

Simon Bridge · Feb 25, 2015

In an actual experiment, you would have two runs. A control run for the background and an experiment run that will have one or more peaks.
You would then subtract the background from the experiment - leaving the peaks ... which would IRL be Lorentzian. The background distribution would normally be thermal. You'd have to propagate any errors from the background subtraction, so the background run is usually done to extreme accuracy.

Some sort of interpolation would be used to work out a FWHM and maxima for the peak sans background.
Note: the shape of the peaks is usually a characteristic of the detector - particularly in radiation experiments.

Since you don't have a background run, you need to find a way to estimate it.
You do this by looking at the data and making an informed guess
... i.e. group the data a long way from the peak, then work out it's statistics (median, interquartile range, whatever you want to use).

Remember it is each count (the height of each bin) that is poisson all through - you are looking for the relationship between the counts.

It you'd plotted the data by hand, you'd have no trouble estimating a best fit curve.

"Inverse problems with sparse data" may be closer to what you are looking for.
Here the background is modeled by a noise function and the proposed theory by a matrix.

deccard · Feb 25, 2015

Simon Bridge said:

In an actual experiment, you would have two runs. A control run for the background and an experiment run that will have one or more peaks.
You would then subtract the background from the experiment - leaving the peaks.

Yes, you are right that when you are making a measurement of a spectra, you should take a background run if possible. This is not always possible, sometimes not even desired, with a real experiment. I've seen, for example, an ion beam mass spectrum that had a peak on every mass visible. I could measure a spectrum with ion source off to get the background but I'd be fooling myself because what if the background is actually caused by a scattered ions from the ion source.

But this is not the question here.

I can do a non-linear fit of the Gaussian function with the background term. See the attached graph. I don't need any background removal before the fitting. And I can get the fit parameter with upper and lower values with a confidence interval.

The fit parameter with 68 % confidence interval:
$$a_0 = 0.49 \pm 0.09 \\
a_1 = 13.4 \pm 0.8 \\
a_2 = 59.82 \pm 0.07 \\
a_3 = 1.4 \pm 0.1 $$

These are of course much more accurate the just by fitting by hand.

Now the problem:

The fit is not done right. The way I do the fit is wrong because I do non-linear least squares fitting to data points that are Poisson distributed. Notice, for example, that the errors in my fit are symmetric when they should not be. They are symmetric because the way the least squares fitting works.

It does not even matter whether I remove the background before the fit, because I would still be wrong to fit using least squares method as the method assumes normally distributed data points on the peak when in reality they are Poisson distributed.

So if the least squares method is wrong. What is the right method?

Stephen Tashi · Feb 28, 2015

Some useful keywords for searching are in the phrase "deconvolution of mixture distributions". You've defined the actual distributionas a "mixture" of two standard forms of probability distribution. From given data, you want to find how much each contributes to the total probability.

What geometry determines the bins? I don't follow the physics yet. Are the bins something non-geometric, like "frequency" or are they something geometric like "detector at (x,y)"?

It isn't clear what your definition of "confidence interval" is. Perhaps you are thinking of a Bayesian "credible interval".

Simon Bridge · Jun 24, 2015

Are you saying that each data point comes with an asymmetric uncertainty?
This does not mean that the parameters you are fitting to the data must be asymmetrically distributed as well, but it may do - and you want to be able to take that into account. So you need to know how to handle this situation.

So you need a refresher on asymmetric errors and their propagaton:
https://www.slac.stanford.edu/econf/C030908/papers/WEMT002.pdf
... you may also want to learn about the Hougaard measure of skewdness for a parameter estimated via non-linear regression.

Fitting distribution to histogram with low number of counts

Attachments

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad The countability paradox of computable numbers

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Fitting distribution to histogram with low number of counts

Attachments

Similar threads