How should I bin my data?

• I

Main Question or Discussion Point

Hi! I have some measurements of the rate of a physical process versus energy. For each energy I have a number of counts and a measurement time associated to it. However, the step (in energy) at which the measurements are made is very small and also the measurement time is small, hence just plotting a rate vs energy for each measured energy doesn't really work (many energy entries also have zero counts).

I would like to bin the data such that to obtain a reasonable shape of the spectrum for a fit (I also have something of the order 10,000 measurements, so doing directly a log-likelihood would not work). Assuming that the error on the x axis (energy) is negligible and the errors are Poisson, how should I choose the right bin size?

Also, how should I choose the x and y for each bin (to use for the fit)? For the y, I think the right way to do is to add all the counts in that bin (N) and all the measuring time (T) and give the rate (y value) as N/T with the error $\sqrt{N}/T$. For x axis, I could just use the average of the energy values in that bin. However, for some energies the measuring time is higher than for others. Should I use a time weighted average in order to get the x of the bin? Should I just use the center of the bin?

Thank you!

Related Set Theory, Logic, Probability, Statistics News on Phys.org
gleem
I need a bit of a clarification. When you make a measurement at an energy you do in fact make the measurement over a small range of energies. So are the measurements contiguous or are their gaps between the measurements? Are the ranges in energies for a given energy the same for all energies?

jedishrfu
Mentor
One way, could be to decide on the granularity in the graph you want to make.

Say for the x-axis you want things to the nearest .01 in value so binning along x would be:

Python:
x =9.867135467;   // some arbitrary x value

new_x = round(x*100.0f) / 100.0f;

print("new_x="+str(new_x))
Similarly for y.

I need a bit of a clarification. When you make a measurement at an energy you do in fact make the measurement over a small range of energies. So are the measurements contiguous or are their gaps between the measurements? Are the ranges in energies for a given energy the same for all energies?
The energies are not equally spaced. We actually use a laser frequency to measure the energy of a transition and there is a bit of a jitter in the exact laser wavelength. Ideally we want to scan within a given range in equal steps, but in practice this is not possible. However we have very good control over the actual value of the laser at any given time i.e. we know exactly what energy we are testing, even if we can't always make it be the value we want. Long story short, we scan energies in a given range, but the steps are irregular. Thank you!

Are the scanning energies disjoint or do they overlap?

Are the scanning energies disjoint or do they overlap?
Here is an example of what I am trying to do. Assume I have the following measurements, showing energy (in some arbitrary units), number of counts at that energy and time of measurement.
$10, 1, 0.01$
$22, 2, 0.02$
$29, 0, 0.02$
$35, 2, 0.01$
In my case, the energies are very close to each other (the energy increment is a few MHz, we are using a tunable laser), but here I assume they are 10, 22, 29, 35 in some arbitrary units for simplicity. However, the numbers in the time and counts columns are actually similar to what I have. These are exact values, not the result of some binning. I want to bin this data. The data is pretty smooth so doing a reasonable binning shouldn't remove interesting features, but it would make the fit easier. Also another reason is just for visualization purposes, as right now I can't really see any peak. My questions is, how can I create a bin from these 4 points (in my case it will be maybe 50 or 100 points per bin, but the idea should be the same). How would I calculate the x and y of a bin containing these 4 data points?

The rates are 100, 100, 0, 200 so the rate in the new bin, containing all these data points would be the sum which is 400. But I assume I would have to divide by the number of points, which is 4, so I would get 100 counts per second, which seems reasonable, but I am not sure if it's right. Also, would the center of the bin be the average of these 4 energies? So on my new plot (rate vs energy) i would put a point at (24,100)? Lastly, what would be the error on the y axis, after I put everything in one bin.

Please let me know if you need further details and thank you for help!

gleem
To express a rate with so few counts would seem to me to create huge errors in the data even if you binned them substantially. Not knowing what your spectrum is suppose to look like it is hard to recommend a procedure. Are the times of measurement a pulse width and can they be increased?

It would be helpful to know the experimental procedure.

To express a rate with so few counts would seem to me to create huge errors in the data even if you binned them substantially. Not knowing what your spectrum is suppose to look like it is hard to recommend a procedure. Are the times of measurement a pulse width and can they be increased?

It would be helpful to know the experimental procedure.
Thank you for your reply! Th experiment is done, we are at the analysis part, so we have to work with the data we have for now. However, here is the experimental procedure on short. We do resonant ionization on some atoms. So we shoot a laser at a tunable frequency to the atoms, then we shoot another laser that is supposed to ionize only the atoms who got excited in the first step (i.e. who were ionized). We then count the number of ionized particles as a function of the tunable frequency and we are supposed to fit the result with a Voigt peak. The second laser is fixed, but the first one has an adjustable frequency. However that frequency is not super stable, so even if we would like to measure for, say, 1 second at a fixed frequency, the real frequency might change on a scale of miliseconds. We can measure very well the value of this frequency, so even if it varies and we can't do much about it, we know its value at any moment. I attach below a small sample of the actual data that I need to work with (we have hundred of thousands of points per scan here are just 2000). The columns are measurement time in seconds, wavenumber in $cm^{-1}$ and counts. Please let me know if you need further details and thank you for help!

Attachments

• 146.5 KB Views: 71
gleem
So the first column is the time, the second is the energy, and the third is the counts?

gleem
Is the energy expressed in terms of frequency? is the second column expressed in GHz?

Is the energy expressed in terms of frequency? is the second column expressed in GHz?
The energy is expressed in $cm^{-1}$. And yes, it is time, energy, counts. Thank you!

I believe you need to do the following:
1. Divide the counts by measurement time to get rates (and the uncertainty in that rate number)
2. Choose a convenient energy bin size (you can always "bin the bins" if you want a bigger bite so make it fairly fine). Use the mean bin energy as the label. Sum the rates and RMS sum the rate uncertainties for each bin.
Does that do it?? I think so.

I believe you need to do the following:
1. Divide the counts by measurement time to get rates (and the uncertainty in that rate number)
2. Choose a convenient energy bin size (you can always "bin the bins" if you want a bigger bite so make it fairly fine). Use the mean bin energy as the label. Sum the rates and RMS sum the rate uncertainties for each bin.
Does that do it?? I think so.
Thank you for your reply! This looks promising. How should I define the error on individual rates. Assuming I have 2 counts in 0.01 seconds, I have 200 c/s, but what is the error? I can't really use $\sqrt{2}/0.01$, as the number of counts is too low to assume that the error is square root of number of counts.

I have 200 c/s, but what is the error? I can't really use √2/0.012/0.01\sqrt{2}/0.01, as the number of counts is too low to assume that the error is square root of number of counts.
That is the way to do it. When you bin 100 such measurements the rate will be x100 but the RMS error will be x10. If the errors are large then you must realize the counts are what they are! I'm open to suggestion but I don't see it any way out.

That is the way to do it. When you bin 100 such measurements the rate will be x100 but the RMS error will be x10. If the errors are large then you must realize the counts are what they are! I'm open to suggestion but I don't see it any way out.
Actually what I was saying was that I don't think that the error on 2 counts is $\sqrt{2}$. I remember we can use this formula when a Poisson distribution comes close enough to a Gaussian distribution, which is for about 10 counts or more. I am not sure what would be the right error in this case, tho.

How many counts will be in your "binned" data point? It will have to be something like 10 to give a curve. Absent more information I don't know what you can do. What is the result you really want to ascertain? The errors are important estimated guidelines with uncertainty attached. If you have a generic curve to fit, the errors can be mitigated, but not for a single datum.

How many counts will be in your "binned" data point? It will have to be something like 10 to give a curve. Absent more information I don't know what you can do. What is the result you really want to ascertain? The errors are important estimated guidelines with uncertainty attached. If you have a generic curve to fit, the errors can be mitigated, but not for a single datum.
Oh sorry for the missunderstanding. The bin will have a lot more than that (at least the bin size I would need), probably around 100. But you said: Divide the counts by measurement time to get rates (and the uncertainty in that rate number) I assumed you meant to get the uncertainty for each data point, before binning (which is 1-2 counts). This is the value I don't know how to calculate. I need to fit a Voigt profile to the data, after binning.

I understand your reticence but a single count is what it is. The error estimate for a single count is 100%. For the binned data (100 such points) the RMS error would then be 10%. This is true for counts and rates.....

I understand your reticence but a single count is what it is. The error estimate for a single count is 100%. For the binned data (100 such points) the RMS error would then be 10%. This is true for counts and rates.....
Thank you for your reply. I don't think that the error on 1 is right (I agree with the one on 100). If the error on 1 would be 1, it means there is a pretty big chance (2 sigma) to get -1 counts, which doesn't make sense. This is the reason why you can approximate the error as square root of the number of counts for 10 or more counts, as in that case it is very unlikely to get negative counts. In the case of 100, a negative count would be 10 sigma away, so that is a very good approximation. But for 1 counts it doesn't make sense to me. What do you think?

gleem
The uncertainty for a Poisson distribution is √n for any n. Since it is asymmetric you for small n you do cannot interpret the -2 sigma the same way.

A question about the time intervals what is their accuracy since the variations is quite broad.

You don't know, a priori, what the noise is. It may indeed correspond to "negative counts" and give you zero instead of one. Your point about the one sided distribution is not incorrect but the central limit theorem will take care of it when you bin.
Suppose you had somehow used a broader energy slice to perform the experiments....the binning would be preordained and the procedure using √N would be exactly consistent with the RMS sum suggested in #12.
Good?

gleem
In the sample data I counted about 90 events out of 2000 samples. I hope some of the data has more events. The increment in energy is exceedingly small and do not always change the same way or amount. What width do you expect your peaks to have ( in unbinned data points)10, 100,1000?

In the sample data I counted about 90 events out of 2000 samples. I hope some of the data has more events. The increment in energy is exceedingly small and do not always change the same way or amount. What width do you expect your peaks to have ( in unbinned data points)10, 100,1000?
@hutchphd @gleem thank you a lot for your help with this. So the frequency of the laser (the x axis) is supposed to increase in a fixed amount over a certain range. However the frequency of the laser is not very stable, so even if it is overall going up, it has lots of variations over small periods of time. This is the main reason why we do measurements at such small time intervals, to know exactly what frequency we are testing. Of course it the laser was stable for, say 1 second, we could measure the counts every one second, and would get much better statistics from one measurement. But because of this laser instability, this is the data I have. However, you can assume that the values of time, counts and wavenumber are right. The percentage of number of events per measured data is more or less what you got, many measurements give zero counts, but of course we have more than 2000 events per one scan range. I could share one scan file i.e. an actual measurement from beginning to the end of a scan, but the .txt file is too large and the .h5 compression is not allowed here. Let me know if there is a way to share it here (the .txt has 30 MB). I attached below the plot obtained from that scan (which is also the scan I shared the 2000 data points from). The binning is not really right: for each bin I calculated the total number of events and divided by the total measurement time in that bin. This is obviously wrong as the time needs to be normalized, but it is just for visual purposes (after all the main issue here is that I don't know how to bin properly). In this case the bin size is $0.0006671 cm^{-1}$, which corresponds to 20 MHz (which is more or less the bin size we want). As you can see there is a small peak then a bigger one between 0.3 and 0.4, which is where we expect them to be from theory. Of course, we have multiple scans of this region, so we will combine multiple scans and the peaks will become more clear, but the binning procedure (which I need help for) should be the same no matter how much data I have.

So long story short, I need to make a proper binning of the data I show below (this binning is wrong, but might help you understand the data), then I need to fit this with 2 Voigt profiles (or something more complicated, depending on some theoretical parameters) and my main question is how to do the binning.

Thank you again for help, and please let me know if you need more details.

Data looks good to me. Your choice of fit should probably be driven by the number you are most interested in.....e.g. do you want the splitting or the absolute number etc etc). Also the way you fit the data will likely not be very sensitive to your binning choice (within reason). You should check that in the end. Enjoy.

gleem
My impression of the spectrum you have shown.

How many measurements were binned?

what are the widths of the peaks in cm-1 than you are looking for?

Very noisy. Error bars may not reflect (underestimate) the random fluctuations in the data unless this is a structured noise. You say you expect to fit two or more peaks. The data is pretty crowded so it is hard to distinguish peak like structures but their may be five of six if you error bar are reasonable (which I am skeptical of).

Calculating a rate from the summed events in the bin divided by the time interval of the bin is IMO acceptable.

Sorry about all the questions but I am not use to this type of data and I need to understand what is going on.

If you let the laser run for 0.1 second or 1 second how much would the wave number change during this time. This instability is just averaging your energy during the data acquisition and is in fact doing the binning for you.