How to Bin Data for Spectrum Fitting with Poisson Errors?

  • I
  • Thread starter kelly0303
  • Start date
  • Tags
    Bin Data
In summary, the author binned the data in order to make a better fit for the spectrum. They used 10, 22, 29, 35 as the bin centers.
  • #1
kelly0303
561
33
Hi! I have some measurements of the rate of a physical process versus energy. For each energy I have a number of counts and a measurement time associated to it. However, the step (in energy) at which the measurements are made is very small and also the measurement time is small, hence just plotting a rate vs energy for each measured energy doesn't really work (many energy entries also have zero counts).

I would like to bin the data such that to obtain a reasonable shape of the spectrum for a fit (I also have something of the order 10,000 measurements, so doing directly a log-likelihood would not work). Assuming that the error on the x-axis (energy) is negligible and the errors are Poisson, how should I choose the right bin size?

Also, how should I choose the x and y for each bin (to use for the fit)? For the y, I think the right way to do is to add all the counts in that bin (N) and all the measuring time (T) and give the rate (y value) as N/T with the error ##\sqrt{N}/T##. For x axis, I could just use the average of the energy values in that bin. However, for some energies the measuring time is higher than for others. Should I use a time weighted average in order to get the x of the bin? Should I just use the center of the bin?

Thank you!
 
Physics news on Phys.org
  • #2
I need a bit of a clarification. When you make a measurement at an energy you do in fact make the measurement over a small range of energies. So are the measurements contiguous or are their gaps between the measurements? Are the ranges in energies for a given energy the same for all energies?
 
  • #3
One way, could be to decide on the granularity in the graph you want to make.

Say for the x-axis you want things to the nearest .01 in value so binning along x would be:

Python:
x =9.867135467;   // some arbitrary x value
   
new_x = round(x*100.0f) / 100.0f;

print("new_x="+str(new_x))

Similarly for y.
 
  • #4
gleem said:
I need a bit of a clarification. When you make a measurement at an energy you do in fact make the measurement over a small range of energies. So are the measurements contiguous or are their gaps between the measurements? Are the ranges in energies for a given energy the same for all energies?
The energies are not equally spaced. We actually use a laser frequency to measure the energy of a transition and there is a bit of a jitter in the exact laser wavelength. Ideally we want to scan within a given range in equal steps, but in practice this is not possible. However we have very good control over the actual value of the laser at any given time i.e. we know exactly what energy we are testing, even if we can't always make it be the value we want. Long story short, we scan energies in a given range, but the steps are irregular. Thank you!
 
  • #5
Are the scanning energies disjoint or do they overlap?
 
  • Like
Likes jedishrfu
  • #6
hutchphd said:
Are the scanning energies disjoint or do they overlap?
Here is an example of what I am trying to do. Assume I have the following measurements, showing energy (in some arbitrary units), number of counts at that energy and time of measurement.
##10, 1, 0.01##
##22, 2, 0.02##
##29, 0, 0.02##
##35, 2, 0.01##
In my case, the energies are very close to each other (the energy increment is a few MHz, we are using a tunable laser), but here I assume they are 10, 22, 29, 35 in some arbitrary units for simplicity. However, the numbers in the time and counts columns are actually similar to what I have. These are exact values, not the result of some binning. I want to bin this data. The data is pretty smooth so doing a reasonable binning shouldn't remove interesting features, but it would make the fit easier. Also another reason is just for visualization purposes, as right now I can't really see any peak. My questions is, how can I create a bin from these 4 points (in my case it will be maybe 50 or 100 points per bin, but the idea should be the same). How would I calculate the x and y of a bin containing these 4 data points?

The rates are 100, 100, 0, 200 so the rate in the new bin, containing all these data points would be the sum which is 400. But I assume I would have to divide by the number of points, which is 4, so I would get 100 counts per second, which seems reasonable, but I am not sure if it's right. Also, would the center of the bin be the average of these 4 energies? So on my new plot (rate vs energy) i would put a point at (24,100)? Lastly, what would be the error on the y axis, after I put everything in one bin.

Please let me know if you need further details and thank you for help!
 
  • #7
To express a rate with so few counts would seem to me to create huge errors in the data even if you binned them substantially. Not knowing what your spectrum is suppose to look like it is hard to recommend a procedure. Are the times of measurement a pulse width and can they be increased?

It would be helpful to know the experimental procedure.
 
  • #8
gleem said:
To express a rate with so few counts would seem to me to create huge errors in the data even if you binned them substantially. Not knowing what your spectrum is suppose to look like it is hard to recommend a procedure. Are the times of measurement a pulse width and can they be increased?

It would be helpful to know the experimental procedure.
Thank you for your reply! Th experiment is done, we are at the analysis part, so we have to work with the data we have for now. However, here is the experimental procedure on short. We do resonant ionization on some atoms. So we shoot a laser at a tunable frequency to the atoms, then we shoot another laser that is supposed to ionize only the atoms who got excited in the first step (i.e. who were ionized). We then count the number of ionized particles as a function of the tunable frequency and we are supposed to fit the result with a Voigt peak. The second laser is fixed, but the first one has an adjustable frequency. However that frequency is not super stable, so even if we would like to measure for, say, 1 second at a fixed frequency, the real frequency might change on a scale of miliseconds. We can measure very well the value of this frequency, so even if it varies and we can't do much about it, we know its value at any moment. I attach below a small sample of the actual data that I need to work with (we have hundred of thousands of points per scan here are just 2000). The columns are measurement time in seconds, wavenumber in ##cm^{-1}## and counts. Please let me know if you need further details and thank you for help!
 

Attachments

  • sample_data.txt
    146.5 KB · Views: 225
  • #9
So the first column is the time, the second is the energy, and the third is the counts?
 
  • #10
Is the energy expressed in terms of frequency? is the second column expressed in GHz?
 
  • #11
gleem said:
Is the energy expressed in terms of frequency? is the second column expressed in GHz?
The energy is expressed in ##cm^{-1}##. And yes, it is time, energy, counts. Thank you!
 
  • #12
I believe you need to do the following:
  1. Divide the counts by measurement time to get rates (and the uncertainty in that rate number)
  2. Choose a convenient energy bin size (you can always "bin the bins" if you want a bigger bite so make it fairly fine). Use the mean bin energy as the label. Sum the rates and RMS sum the rate uncertainties for each bin.
Does that do it?? I think so.
 
  • #13
hutchphd said:
I believe you need to do the following:
  1. Divide the counts by measurement time to get rates (and the uncertainty in that rate number)
  2. Choose a convenient energy bin size (you can always "bin the bins" if you want a bigger bite so make it fairly fine). Use the mean bin energy as the label. Sum the rates and RMS sum the rate uncertainties for each bin.
Does that do it?? I think so.
Thank you for your reply! This looks promising. How should I define the error on individual rates. Assuming I have 2 counts in 0.01 seconds, I have 200 c/s, but what is the error? I can't really use ##\sqrt{2}/0.01##, as the number of counts is too low to assume that the error is square root of number of counts.
 
  • #14
kelly0303 said:
I have 200 c/s, but what is the error? I can't really use √2/0.012/0.01\sqrt{2}/0.01, as the number of counts is too low to assume that the error is square root of number of counts.
That is the way to do it. When you bin 100 such measurements the rate will be x100 but the RMS error will be x10. If the errors are large then you must realize the counts are what they are! I'm open to suggestion but I don't see it any way out.
 
  • #15
hutchphd said:
That is the way to do it. When you bin 100 such measurements the rate will be x100 but the RMS error will be x10. If the errors are large then you must realize the counts are what they are! I'm open to suggestion but I don't see it any way out.
Actually what I was saying was that I don't think that the error on 2 counts is ##\sqrt{2}##. I remember we can use this formula when a Poisson distribution comes close enough to a Gaussian distribution, which is for about 10 counts or more. I am not sure what would be the right error in this case, tho.
 
  • #16
How many counts will be in your "binned" data point? It will have to be something like 10 to give a curve. Absent more information I don't know what you can do. What is the result you really want to ascertain? The errors are important estimated guidelines with uncertainty attached. If you have a generic curve to fit, the errors can be mitigated, but not for a single datum.
 
  • #17
hutchphd said:
How many counts will be in your "binned" data point? It will have to be something like 10 to give a curve. Absent more information I don't know what you can do. What is the result you really want to ascertain? The errors are important estimated guidelines with uncertainty attached. If you have a generic curve to fit, the errors can be mitigated, but not for a single datum.
Oh sorry for the missunderstanding. The bin will have a lot more than that (at least the bin size I would need), probably around 100. But you said: Divide the counts by measurement time to get rates (and the uncertainty in that rate number) I assumed you meant to get the uncertainty for each data point, before binning (which is 1-2 counts). This is the value I don't know how to calculate. I need to fit a Voigt profile to the data, after binning.
 
  • #18
I understand your reticence but a single count is what it is. The error estimate for a single count is 100%. For the binned data (100 such points) the RMS error would then be 10%. This is true for counts and rates...
 
  • #19
hutchphd said:
I understand your reticence but a single count is what it is. The error estimate for a single count is 100%. For the binned data (100 such points) the RMS error would then be 10%. This is true for counts and rates...
Thank you for your reply. I don't think that the error on 1 is right (I agree with the one on 100). If the error on 1 would be 1, it means there is a pretty big chance (2 sigma) to get -1 counts, which doesn't make sense. This is the reason why you can approximate the error as square root of the number of counts for 10 or more counts, as in that case it is very unlikely to get negative counts. In the case of 100, a negative count would be 10 sigma away, so that is a very good approximation. But for 1 counts it doesn't make sense to me. What do you think?
 
  • #20
The uncertainty for a Poisson distribution is √n for any n. Since it is asymmetric you for small n you do cannot interpret the -2 sigma the same way.

A question about the time intervals what is their accuracy since the variations is quite broad.
 
  • #21
You don't know, a priori, what the noise is. It may indeed correspond to "negative counts" and give you zero instead of one. Your point about the one sided distribution is not incorrect but the central limit theorem will take care of it when you bin.
Suppose you had somehow used a broader energy slice to perform the experiments...the binning would be preordained and the procedure using √N would be exactly consistent with the RMS sum suggested in #12.
Good?
 
  • #22
In the sample data I counted about 90 events out of 2000 samples. I hope some of the data has more events. The increment in energy is exceedingly small and do not always change the same way or amount. What width do you expect your peaks to have ( in unbinned data points)10, 100,1000?
 
  • #23
gleem said:
In the sample data I counted about 90 events out of 2000 samples. I hope some of the data has more events. The increment in energy is exceedingly small and do not always change the same way or amount. What width do you expect your peaks to have ( in unbinned data points)10, 100,1000?
@hutchphd @gleem thank you a lot for your help with this. So the frequency of the laser (the x axis) is supposed to increase in a fixed amount over a certain range. However the frequency of the laser is not very stable, so even if it is overall going up, it has lots of variations over small periods of time. This is the main reason why we do measurements at such small time intervals, to know exactly what frequency we are testing. Of course it the laser was stable for, say 1 second, we could measure the counts every one second, and would get much better statistics from one measurement. But because of this laser instability, this is the data I have. However, you can assume that the values of time, counts and wavenumber are right. The percentage of number of events per measured data is more or less what you got, many measurements give zero counts, but of course we have more than 2000 events per one scan range. I could share one scan file i.e. an actual measurement from beginning to the end of a scan, but the .txt file is too large and the .h5 compression is not allowed here. Let me know if there is a way to share it here (the .txt has 30 MB). I attached below the plot obtained from that scan (which is also the scan I shared the 2000 data points from). The binning is not really right: for each bin I calculated the total number of events and divided by the total measurement time in that bin. This is obviously wrong as the time needs to be normalized, but it is just for visual purposes (after all the main issue here is that I don't know how to bin properly). In this case the bin size is ##0.0006671 cm^{-1}##, which corresponds to 20 MHz (which is more or less the bin size we want). As you can see there is a small peak then a bigger one between 0.3 and 0.4, which is where we expect them to be from theory. Of course, we have multiple scans of this region, so we will combine multiple scans and the peaks will become more clear, but the binning procedure (which I need help for) should be the same no matter how much data I have.

So long story short, I need to make a proper binning of the data I show below (this binning is wrong, but might help you understand the data), then I need to fit this with 2 Voigt profiles (or something more complicated, depending on some theoretical parameters) and my main question is how to do the binning.

Thank you again for help, and please let me know if you need more details.
plot.png
 
  • #24
Data looks good to me. Your choice of fit should probably be driven by the number you are most interested in...e.g. do you want the splitting or the absolute number etc etc). Also the way you fit the data will likely not be very sensitive to your binning choice (within reason). You should check that in the end. Enjoy.
 
  • #25
My impression of the spectrum you have shown.

How many measurements were binned?

what are the widths of the peaks in cm-1 than you are looking for?

Very noisy. Error bars may not reflect (underestimate) the random fluctuations in the data unless this is a structured noise. You say you expect to fit two or more peaks. The data is pretty crowded so it is hard to distinguish peak like structures but their may be five of six if you error bar are reasonable (which I am skeptical of).

Calculating a rate from the summed events in the bin divided by the time interval of the bin is IMO acceptable.
Sorry about all the questions but I am not use to this type of data and I need to understand what is going on.If you let the laser run for 0.1 second or 1 second how much would the wave number change during this time. This instability is just averaging your energy during the data acquisition and is in fact doing the binning for you.
 
  • #26
gleem said:
My impression of the spectrum you have shown.

How many measurements were binned?

what are the widths of the peaks in cm-1 than you are looking for?

Very noisy. Error bars may not reflect (underestimate) the random fluctuations in the data unless this is a structured noise. You say you expect to fit two or more peaks. The data is pretty crowded so it is hard to distinguish peak like structures but their may be five of six if you error bar are reasonable (which I am skeptical of).m

Calculating a rate from the summed events in the bin divided by the time interval of the bin is IMO acceptable.
Sorry about all the questions but I am not use to this type of data and I need to understand what is going on.If you let the laser run for 0.1 second or 1 second how much would the wave number change during this time. This instability is just averaging your energy during the data acquisition and is in fact doing the binning for you.
@gleem @hutchphd Thank you for you reply! The plot might look good, but the binning is wrong. As I said, for each bin I divided the total number of counts by the total measuring time in that bin. Also the error on each point is square root of the number of points divided by time (time is not under square root). But this is wrong. This works only if for each point the measuring time was the same, but it is not. For example (a bit exaggerated, but illustrates the point), if we have 1000 counts in 10 seconds and 200 counts in 20 seconds, adding the rates I get ##100+10=110##, which is right i.e. if I would count for 1 second I expect 110 counts. But the way I did gives ##(1000+200)/(10+20)=40## which is no where close to the right value. So my approach was just for visual purposes but it is very wrong. I also found this post, somehow related to my question: https://www.physicsforums.com/threads/bin-center-location.978600/ Everyone there seems to agree that this approach of adding the counts and dividing by total time, for each bin, is fundamentally wrong. So my initial question is still open, what is the right way to combine the data and what is the error on each of these new data points?

@gleem For this plot I used 372562 data points. Most of the scans have more data than this. Also, yes, as I said, I am not sure at all about the error on the points, this is also part of my original question. The fitting procedure is a bit more complicated. You are right there are more than 2 peaks, but they should become visible once multiple scans are combined. The fit will be done using a theoretical template, from which we know how many peaks we expect, we also know that some peaks (from rotational excitations, for example) should be equally spaced, which constraints a lot of parameters. But this comes latter, for now I still don't know how to properly bin the data. The width of the peaks I would say are between 0.01 and 0.1 ##cm^{-1}##, depending on the type of transition.

About the laser, the original idea was to to increase the frequency by ##0.06 cm^{-1}## per second. But it turns out that doing this laser frequency changing is not as smooth as we expected, so going from one step to the next is not smooth at all, so we get a lot of jitter. So it's a bit difficult to predict how the laser will change, given that we also force it to change, but we don't have a great control on the actual changing mechanism. So in the end we decided to just measure the frequency very often to know at least what it is going on. We might rerun the experiment later, with a much stable laser, but for now this is the only data available.

Thank you and let me know if I can provide further information.
 
  • #27
kelly0303 said:
Everyone there seems to agree that this approach of adding the counts and dividing by total time, for each bin, is fundamentally wrong. So my initial question is still open, what is the right way to combine the data and what is the error on each of these new data points?

I believe this was asked previously and answered (correctly) in #12. Why are you doing it differently?
 
  • #28
kelly0303 said:
The width of the peaks I would say are between 0.01 and 0.1 cm−1cm−1cm^{-1}, depending on the type of transition.

This should tell you how many data points to bin. You want enough points in the peak to define it well enough to fit it with reasonable accurate parameters. So if your energy increment is 0.0006 cm-1 and the width of the peak is 0.01 cm-1 at the narrowest then assuming 6 data points (but not less than 4) within the full width at half max is adequate then you will need 6 bins (but not less than 4) to span the peak each with 3 data (not more than 4 data points) points in each bin.

Maybe this is not important but, since your energy increments are not the same and sometime increase and sometimes decrease you should first go through the data and order the energies to be always increasing.

Assign a bin energy equal to the average of the energies within. An of course you need to normalize the time intervals since they vary too much.
 
  • #29
hutchphd said:
I believe this was asked previously and answered (correctly) in #12. Why are you doing it differently?
Sorry for not coming back to that. I tried it, but it seems like I have some data points (I guess random fluctuations), for which the measurement time is very small (around 0.001 seconds) but I have 1 count in that time. This gives a huge rate, making that bin significantly bigger than the others, even if there is not way for it to be the case, physically. Should I somehow do some time weighting? Thank you!
 
  • #30
gleem said:
This should tell you how many data points to bin. You want enough points in the peak to define it well enough to fit it with reasonable accurate parameters. So if your energy increment is 0.0006 cm-1 and the width of the peak is 0.01 cm-1 at the narrowest then assuming 6 data points (but not less than 4) within the full width at half max is adequate then you will need 6 bins (but not less than 4) to span the peak each with 3 data (not more than 4 data points) points in each bin.

Maybe this is not important but, since your energy increments are not the same and sometime increase and sometimes decrease you should first go through the data and order the energies to be always increasing.

Assign a bin energy equal to the average of the energies within. An of course you need to normalize the time intervals since they vary too much.
Thank you for your reply. How should I do time normalization? Assuming I have 1 count in 0.01 seconds and 1 count in 0.02 seconds, should I normalize by taking 2 counts in 0.02 seconds and 1 count in 0.02 seconds, getting an average of 1.5 counts in 0.01 seconds? I am not sure if I can just multiply the counts by 2 (or any integer). That 1 count is the result of a Poisson process so I am not sure that having 1 count in 0.01 seconds implies 2 in 0.02 seconds.
 
  • #31
kelly0303 said:
Thank you for your reply. How should I do time normalization? Assuming I have 1 count in 0.01 seconds and 1 count in 0.02 seconds, should I normalize by taking 2 counts in 0.02 seconds and 1 count in 0.02 seconds, getting an average of 1.5 counts in 0.01 seconds? I am not sure if I can just multiply the counts by 2 (or any integer). That 1 count is the result of a Poisson process so I am not sure that having 1 count in 0.01 seconds implies 2 in 0.02 seconds.

This is a problem with low or zero counts. For zero counts does that mean actual zero counts or you didn't count long enough? In calculating a rate for very low counts for a short time you have a large error. One count in 0..01 seconds gives a rate of 100 cps ±100 cps as you know and adding such data only makes the total error worse. That is why I think adding counts in a bin and dividing by the total time for that bin is better(and more correct). The more points in a bin the more confidence you have in the numbers. IMO.

Your example of adding rates in a bin vs calculating a rate from the total bin counts divided by the total time you ignored that the latter method produces an average count rate in the bin. To properly compare you need to average the former by dividing the summed rates by the number of them. Thus in your example you should compare 55 to 40. But you say see they do not agree. So what is the problem? Well first you assumed that the two rates where partially concurrent. 100 cps happening at the same time 10 cps for 10 sec. In fact this is not true. They are sequential 100 cps for 10 sec followed by 10 cps for 20 sec. So the whole interval is 30 sec. How do you average this? you take the time weighted average. 100 cps for 1/3 of the time and 10 cps for 2/3 of the time. Summing these you get 40 cps. This is the same reasoning you would use for averaging the rate of a pulsed signal.
 
  • #32
kelly0303 said:
Sorry for not coming back to that. I tried it, but it seems like I have some data points (I guess random fluctuations), for which the measurement time is very small (around 0.001 seconds) but I have 1 count in that time. This gives a huge rate, making that bin significantly bigger than the others, even if there is not way for it to be the case, physically. Should I somehow do some time weighting? Thank you!
I did wonder! Thanks.
If the data you show is representative, you could first just sum ~10 successive measurements without worrying about the error and report the total time and counts and average energy. Then do #12 as described. Any inaccuracy will be negligible (unless the data is very different).
 
  • #33
gleem said:
This is a problem with low or zero counts. For zero counts does that mean actual zero counts or you didn't count long enough? In calculating a rate for very low counts for a short time you have a large error. One count in 0..01 seconds gives a rate of 100 cps ±100 cps as you know and adding such data only makes the total error worse. That is why I think adding counts in a bin and dividing by the total time for that bin is better(and more correct). The more points in a bin the more confidence you have in the numbers. IMO.

Your example of adding rates in a bin vs calculating a rate from the total bin counts divided by the total time you ignored that the latter method produces an average count rate in the bin. To properly compare you need to average the former by dividing the summed rates by the number of them. Thus in your example you should compare 55 to 40. But you say see they do not agree. So what is the problem? Well first you assumed that the two rates where partially concurrent. 100 cps happening at the same time 10 cps for 10 sec. In fact this is not true. They are sequential 100 cps for 10 sec followed by 10 cps for 20 sec. So the whole interval is 30 sec. How do you average this? you take the time weighted average. 100 cps for 1/3 of the time and 10 cps for 2/3 of the time. Summing these you get 40 cps. This is the same reasoning you would use for averaging the rate of a pulsed signal.
Thanks a lot for this! So the approaches are equivalent, as long as I use a time weighted average, not just a simple average. What is the error on the rate in this case? If I add the counts and divide by time, I would get ##\sqrt{1200}/30##, as the rate error. If I do the time weighted average, I get ##\sqrt{1000}/10*1/3+\sqrt{200}/20*2/3 = (\sqrt{1000}+\sqrt{200})/30##. Which one is the right error? Thank you!
 
  • #34
Your error calcs are incorrect. For rates the uncertainty is ##\sqrt{rate/time}##
EDIT : sorry you are correct here.

The error of the weighted average of the rates is ##\sqrt{(1/3^2)\sigma_{r1} ^{2} + (2/3)^2\sigma_{r2}^2 }## using propagation of errors for the function r1/3 +2r2/3which gives the same result as ##\sqrt{40/30}## =1.15 cps
 
Last edited:
  • #35
gleem said:
Your error calcs are incorrect. For rates the uncertainty is ##\sqrt{rate/time}##
EDIT : sorry you are correct here.

The error of the weighted average of the rates is ##\sqrt{(1/3^2)\sigma_{r1} ^{2} + (2/3)^2\sigma_{r2}^2 }## using propagation of errors for the function r1/3 +2r2/3which gives the same result as ##\sqrt{40/30}## =1.15 cps
Sorry is the first formula correct? Taking the square root of time would mean that the error on the rate doesn't have the same units as the rate itself.
 

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
28
Views
3K
  • Set Theory, Logic, Probability, Statistics
2
Replies
40
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
18
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
7K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
2K
  • Classical Physics
Replies
0
Views
83
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
922
  • Set Theory, Logic, Probability, Statistics
Replies
16
Views
1K
Back
Top