# Fitting distribution to histogram with low number of counts

Tags:
1. Feb 24, 2015

### deccard

I have one dimensional binned data that has a peak to which I need to fit a distribution, such as Gaussian or Lorentzian, that is described with four parameters, height, width, centroid position and the background. The problem is that the counts per bin are low and the peak is only 5-6 bins wide in a quite large amount of background. The bins belonging to the background have usually zero or one counts, and the bin containing the largest amount of counts has about 15 counts.

Counts in each bin are Poisson distributed as the counts are random and independent. The background has the same average rate through out the data.

The real world example could be a spot of normally spread long lived radioactive material on ground that is measure by moving a scintillator detector through the spot one meter at a time keeping the measurement time the same in each measurement.

My first idea was to fit the data using non-linear least square method but This, I guess, will fail because of the low number of counts per bin thus the values are not normally distributed but Poisson distributed? Especially a weighted least square method would not work.

Could I use maximum likelihood estimation method to fit a distribution? I would need to get, let's say, 95% or 68% confidence intervals for the fit parameter in addition to the parameter themselves especially for the background.

In practice I am using Matlab for the fitting. http://se.mathworks.com/help/stats/mle.html

An example data:

0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 2, 2, 1, 0, 0, 1, 2, 0, 1, 0, 2, 0, 0, 1, 0, 0, 0, 2, 0, 0, 1, 0, 2, 2, 1, 3, 10, 15, 5, 6, 2, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 0, 0, 2, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 3, 0, 0, 1, 0

Also, could I use MLE safely for an ash arbitrarily shaped distribution?

2. Feb 25, 2015

### Simon Bridge

You should also have collected a background run ... you know the uncertainties for each bin though.

Your main problem is how to fit a curve with few data points ... start by looking at the data and making some guesses, that will inform your decision-making.
i.e. how big is the background going to be? How wide would you expect to find the FWHM?

With so few actual data (vs background) you can probably do a fit by hand.

What is the context of the data - is this real world data or have you just been handed some numbers?
What is the intended use of the resulting fit? i.e. is it just to be turned in as an assignment?
What level is this to be done at?

3. Feb 25, 2015

### deccard

In principle, if I wanted to get the background only, I could take the counts away from the peak and determine background. Still, there should be a possibility to fit the data so that the background level is included in the fitted profile.

I know the uncertainties for each bin. As the number of counts is low per bin I cannot use just the approximate √N for errors and use weighted least squares fitting. That will fail. For example, the 0 count bins would produce 0 error causing infinite weight to those points. The proper error for each bin is asymmetric making weighted least squares method for fitting inappropriate.

The mean value of the Poisson distribution that produces 0 counts, for example, can be said to have to be between 0 and 3.689 at 95% confidence interval.This can be calculated by using the Chi-squared distribution.

The data that I gave was actually just my simulated data that I produced Matlab using Poisson distribution. The point for me is to learn to fit this data with the most appropriate method from the point of statistics, so that I would get an lower and upper bound, with some confidence interval, for the fitting parameters of the distribution that includes the background.

4. Feb 25, 2015

### Simon Bridge

i.e. how big is the background going to be? How wide would you expect to find the FWHM?

It sounds like you can think of all the ways this can be hard. Try thinking of ways it could be simple.
i.e. you have two distributions summed really - the distribution of the background and the distribution giving rise to the peak.
So how is the background distributed?

You are asked to fit "a distribution" ... presumably not all of them?
Which sort of distribution seems appropriate here?

It strikes me that this is an assignment more about how you make these decisions than it is about your ability to use algorithms.

5. Feb 25, 2015

### deccard

Yes, I should have said that the example data was made using Gaussian distribution.

The simulated data is done with Matlab command: poissrnd(ones(1,length(x))*0.4+13*exp(-(x-8.7).^2/(1.89^2))), where the x = 0 to 100 in steps of 1 (x=0:1:100)

That means that each bin has background counts taken from Poisson distribution with the mean of 0.4. That is how big the background is. This is because random counting processes are Poisson processes and it is counting processes, such as taking events from a radiation monitor, that I want to analyze data for.

On top of the background there is a peak generated by the Gaussian function that has the FWHM of $$1.89\sqrt{ln(2)}$$, or approximately 1.5735.

These I know, of course, because I have generated the data myself. But let's say that this was an actual measured data. Then I would not know these values.

I would perhaps know that the peak should be Gaussian. Then I would want to fit function

$$f(x) = a_{0} + a_{1}e^{-\big(\frac{x-a_3}{a_4}\big)^2}$$

to the data.

By knowing that the counts in the bins are Poisson distributed, I should be able to fit it and then get an upper and lower bounds with certain confidence intervals for the fitting parameters.

The Gaussian was just an example. I don't wanna be restricted to only gaussians. I would like to know the way to fit also it if the peak was Lorenzian instaed.

6. Feb 25, 2015

### Simon Bridge

In an actual experiment, you would have two runs. A control run for the background and an experiment run that will have one or more peaks.
You would then subtract the background from the experiment - leaving the peaks ... which would IRL be Lorentzian. The background distribution would normally be thermal. You'd have to propagate any errors from the background subtraction, so the background run is usually done to extreme accuracy.

Some sort of interpolation would be used to work out a FWHM and maxima for the peak sans background.
Note: the shape of the peaks is usually a characteristic of the detector - particularly in radiation experiments.

Since you don't have a background run, you need to find a way to estimate it.
You do this by looking at the data and making an informed guess
... i.e. group the data a long way from the peak, then work out it's statistics (median, interquartile range, whatever you want to use).

Remember it is each count (the height of each bin) that is poisson all through - you are looking for the relationship between the counts.

It you'd plotted the data by hand, you'd have no trouble estimating a best fit curve.

"Inverse problems with sparse data" may be closer to what you are looking for.
Here the background is modelled by a noise function and the proposed theory by a matrix.

Last edited: Feb 25, 2015
7. Feb 25, 2015

### deccard

Yes, you are right that when you are making a measurement of a spectra, you should take a background run if possible. This is not always possible, sometimes not even desired, with a real experiment. I've seen, for example, an ion beam mass spectrum that had a peak on every mass visible. I could measure a spectrum with ion source off to get the background but I'd be fooling myself because what if the background is actually caused by a scattered ions from the ion source.

But this is not the question here.

I can do a non-linear fit of the Gaussian function with the background term. See the attached graph. I don't need any background removal before the fitting. And I can get the fit parameter with upper and lower values with a confidence interval.

The fit parameter with 68 % confidence interval:
$$a_0 = 0.49 \pm 0.09 \\ a_1 = 13.4 \pm 0.8 \\ a_2 = 59.82 \pm 0.07 \\ a_3 = 1.4 \pm 0.1$$

These are of course much more accurate the just by fitting by hand.

Now the problem:

The fit is not done right. The way I do the fit is wrong because I do non-linear least squares fitting to data points that are Poisson distributed. Notice, for example, that the errors in my fit are symmetric when they should not be. They are symmetric because the way the least squares fitting works.

It does not even matter whether I remove the background before the fit, because I would still be wrong to fit using least squares method as the method assumes normally distributed data points on the peak when in reality they are Poisson distributed.

So if the least squares method is wrong. What is the right method?

#### Attached Files:

• ###### fit.png
File size:
68.9 KB
Views:
126
8. Feb 28, 2015

### Stephen Tashi

Some useful keywords for searching are in the phrase "deconvolution of mixture distributions". You've defined the actual distributionas a "mixture" of two standard forms of probability distribution. From given data, you want to find how much each contributes to the total probability.

What geometry determines the bins? I don't follow the physics yet. Are the bins something non-geometric, like "frequency" or are they something geometric like "detector at (x,y)"?

It isn't clear what your definition of "confidence interval" is. Perhaps you are thinking of a Bayesian "credible interval".

Last edited: Mar 2, 2015
9. Jun 24, 2015

### Simon Bridge

Are you saying that each data point comes with an asymmetric uncertainty?
This does not mean that the parameters you are fitting to the data must be asymmetrically distributed as well, but it may do - and you want to be able to take that into account. So you need to know how to handle this situation.

So you need a refresher on asymmetric errors and their propagaton:
https://www.slac.stanford.edu/econf/C030908/papers/WEMT002.pdf
... you may also want to learn about the Hougaard measure of skewdness for a parameter estimated via non-linear regression.