I How to Bin Data for Spectrum Fitting with Poisson Errors?

  • Thread starter Thread starter kelly0303
  • Start date Start date
  • Tags Tags
    Bin Data
Click For Summary
Binning data for spectrum fitting with Poisson errors involves careful consideration of measurement time and count rates. The recommended approach is to divide counts by measurement time to obtain rates, using the mean energy of the bin for the x-axis and summing counts for the y-axis. For error estimation, the uncertainty for low counts can be challenging; while the standard method is to use the square root of counts, this may not apply well for very low counts. It's suggested to use a larger bin size to ensure sufficient counts for reliable error estimation, ideally aiming for at least 10 counts per bin. Ultimately, the goal is to create a smooth spectrum that can be fitted with a Voigt profile, despite the irregularities in energy measurements.
  • #31
kelly0303 said:
Thank you for your reply. How should I do time normalization? Assuming I have 1 count in 0.01 seconds and 1 count in 0.02 seconds, should I normalize by taking 2 counts in 0.02 seconds and 1 count in 0.02 seconds, getting an average of 1.5 counts in 0.01 seconds? I am not sure if I can just multiply the counts by 2 (or any integer). That 1 count is the result of a Poisson process so I am not sure that having 1 count in 0.01 seconds implies 2 in 0.02 seconds.

This is a problem with low or zero counts. For zero counts does that mean actual zero counts or you didn't count long enough? In calculating a rate for very low counts for a short time you have a large error. One count in 0..01 seconds gives a rate of 100 cps ±100 cps as you know and adding such data only makes the total error worse. That is why I think adding counts in a bin and dividing by the total time for that bin is better(and more correct). The more points in a bin the more confidence you have in the numbers. IMO.

Your example of adding rates in a bin vs calculating a rate from the total bin counts divided by the total time you ignored that the latter method produces an average count rate in the bin. To properly compare you need to average the former by dividing the summed rates by the number of them. Thus in your example you should compare 55 to 40. But you say see they do not agree. So what is the problem? Well first you assumed that the two rates where partially concurrent. 100 cps happening at the same time 10 cps for 10 sec. In fact this is not true. They are sequential 100 cps for 10 sec followed by 10 cps for 20 sec. So the whole interval is 30 sec. How do you average this? you take the time weighted average. 100 cps for 1/3 of the time and 10 cps for 2/3 of the time. Summing these you get 40 cps. This is the same reasoning you would use for averaging the rate of a pulsed signal.
 
Physics news on Phys.org
  • #32
kelly0303 said:
Sorry for not coming back to that. I tried it, but it seems like I have some data points (I guess random fluctuations), for which the measurement time is very small (around 0.001 seconds) but I have 1 count in that time. This gives a huge rate, making that bin significantly bigger than the others, even if there is not way for it to be the case, physically. Should I somehow do some time weighting? Thank you!
I did wonder! Thanks.
If the data you show is representative, you could first just sum ~10 successive measurements without worrying about the error and report the total time and counts and average energy. Then do #12 as described. Any inaccuracy will be negligible (unless the data is very different).
 
  • #33
gleem said:
This is a problem with low or zero counts. For zero counts does that mean actual zero counts or you didn't count long enough? In calculating a rate for very low counts for a short time you have a large error. One count in 0..01 seconds gives a rate of 100 cps ±100 cps as you know and adding such data only makes the total error worse. That is why I think adding counts in a bin and dividing by the total time for that bin is better(and more correct). The more points in a bin the more confidence you have in the numbers. IMO.

Your example of adding rates in a bin vs calculating a rate from the total bin counts divided by the total time you ignored that the latter method produces an average count rate in the bin. To properly compare you need to average the former by dividing the summed rates by the number of them. Thus in your example you should compare 55 to 40. But you say see they do not agree. So what is the problem? Well first you assumed that the two rates where partially concurrent. 100 cps happening at the same time 10 cps for 10 sec. In fact this is not true. They are sequential 100 cps for 10 sec followed by 10 cps for 20 sec. So the whole interval is 30 sec. How do you average this? you take the time weighted average. 100 cps for 1/3 of the time and 10 cps for 2/3 of the time. Summing these you get 40 cps. This is the same reasoning you would use for averaging the rate of a pulsed signal.
Thanks a lot for this! So the approaches are equivalent, as long as I use a time weighted average, not just a simple average. What is the error on the rate in this case? If I add the counts and divide by time, I would get ##\sqrt{1200}/30##, as the rate error. If I do the time weighted average, I get ##\sqrt{1000}/10*1/3+\sqrt{200}/20*2/3 = (\sqrt{1000}+\sqrt{200})/30##. Which one is the right error? Thank you!
 
  • #34
Your error calcs are incorrect. For rates the uncertainty is ##\sqrt{rate/time}##
EDIT : sorry you are correct here.

The error of the weighted average of the rates is ##\sqrt{(1/3^2)\sigma_{r1} ^{2} + (2/3)^2\sigma_{r2}^2 }## using propagation of errors for the function r1/3 +2r2/3which gives the same result as ##\sqrt{40/30}## =1.15 cps
 
Last edited:
  • #35
gleem said:
Your error calcs are incorrect. For rates the uncertainty is ##\sqrt{rate/time}##
EDIT : sorry you are correct here.

The error of the weighted average of the rates is ##\sqrt{(1/3^2)\sigma_{r1} ^{2} + (2/3)^2\sigma_{r2}^2 }## using propagation of errors for the function r1/3 +2r2/3which gives the same result as ##\sqrt{40/30}## =1.15 cps
Sorry is the first formula correct? Taking the square root of time would mean that the error on the rate doesn't have the same units as the rate itself.
 
  • #36
Remember the rate has the time in it already so it works.
I agree with everything @gleem has said. Just to reiterate: because you do in fact want the "interval weighted" rate values when you bin (I screwed this up because I didn't understand the relative energy widths) the binned rate is just
Ntotal/ttotal.​
Similarly the weighted RMS error sum for the rate ends up being
√Ntotal/ttotal
using the propagation of errors formula in #35. All self consistent...and simple
 
  • #37
I don't know if it is cheating but in a similar data analysis, I built a spreadsheet that let me choose the bin size and then looked at the results. I was analyzing microarray data and there if you look at signal strength vs the number if occurrences it has to be placed in bins. There are very few with a very strong signal, and very many with a very weak signal. Each signal is unique, but if you sort it into bins, then you can see the difference between how often a signal response of 0-50 occurs, then how often 50-100, then 100-150, etc.

I would choose the bin size based on the data. It might be cheating but sometimes a change in the bin size changed the "noise" in the data. I had a spreadsheet where I imported the data, hanged the bin size in a single cell, which then calculated and plotted the results.

You should be able to sort your data, then apply a set of columns that apply the bin size test based on a separate cell. I don't think it matters whether the plat you generate is vs the midpoint, but i would use that.
 
  • #38
Adding counts and measurements times in a bin is the right approach, it is the best you can do with the given information. Making a bin includes the assumption that the value doesn't change much within the bin (otherwise the bin is too wide), in that case you can just add counts and times. If you want to be fancy with the x-values you can take the weighted average of the x-values going into that bin as x-value (with the measurement times as weights), but within bins as fine as in your example this shouldn't matter.
The calculations of the uncertainties in the previous posts are good, too.

Bin the measurements after sorting them by increasing wavenumber, of course.

A direct one-dimensional likelihood fit to 10,000 or even 100,000 data points shouldn't be an issue, by the way, unless your degrees of freedom are really excessive.
 

Similar threads

Replies
28
Views
4K
  • · Replies 40 ·
2
Replies
40
Views
5K
  • · Replies 18 ·
Replies
18
Views
3K
  • · Replies 3 ·
Replies
3
Views
8K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 8 ·
Replies
8
Views
4K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 26 ·
Replies
26
Views
3K
  • · Replies 4 ·
Replies
4
Views
1K