How accurate are the peak values from different binning sizes?

  • I
  • Thread starter kelly0303
  • Start date
  • Tags
    Bin Data
In summary: In Matlab you’d use the hist command (histogram in the newest release) to both bin and plot a histogram of the data, then hold and overlay your best-fit Voight curve (whatever that is...) in a different color.
  • #1
kelly0303
561
33
Hello! I am working on a spectroscopy project in which we adjust the wavelength of a laser and get some counts on the detector from some laser-atom interactions. The data that we have is in the form: ##(\lambda##, ##dt##, ##dN)##, where ##dt## is a time interval, ##\lambda## is the laser wavelength used in that time interval, and ##dN## is the number of events in that time interval. I need to make a plot of the event rate (##\frac{dN}{dt}##) vs wavelength, and fit it with a Voigt profile. The wavelength is scanned over a long range, however each individual wavelength is scanned for a short period of time i.e. ##dt## is small, but the difference between 2 consecutive wavelengths is small too. For example an entry could be ##(10000 cm^{-1},0.01 s, 2)## and the next one could be ##(10000.1 cm^{-1},0.01 s,3)##. I need a bit of help related to how to do the fit properly and get a meaningful number for the peak of the Voigt profile. Given the numbers, it seems that I need to re-bin the data in frequency space (I might use frequency, wavelength or wavenumber interchangeably, what I mean is the x-axis which in my case has units of ##cm^{-1}##, sorry for that). Is this a good thing to do? And how should I do the re-binning, as I get slightly different results for each re-binning. Right now I have the value of the peak for several (15) different binnnings, which are quite close, yet a bit different, for example: ##11001.5 \pm 0.2## and ##11001.4 \pm 0.3##, where the error is given by the fitting program (I guess it is the standard deviation associated with the best estimate of the parameters, but I can check in more details if needed; I use lmfit in python). I was thinking to use the mean of these as the reported value, but I am not sure what to use for the error. These numbers are clearly not independent (i.e the value of the peak when I double the bin size is not independent of the value before that, right?) so I can't just use ##\sigma/\sqrt{N}## for the error on mean. Also how should I take into account the error on each measurement (the ##0.2## and ##0.3## in my examples above)? Or should I try a totally different approach? Any suggestion would be greatly appreciated. Thank you!
 
Physics news on Phys.org
  • #2
I don't think there is any value in finding the mean of several peak values, since they are all derived from the same input data. I suggest that you just pick one. I suspect that the error estimate has a minimum at some intermediate bin size. If so, pick that one.
 
  • #3
tnich said:
I don't think there is any value in finding the mean of several peak values, since they are all derived from the same input data. I suggest that you just pick one. I suspect that the error estimate has a minimum at some intermediate bin size. If so, pick that one.
Thank you for your reply! I was thinking to just pick one value, too (using the one with the smallest error sounds like a good choice), but is that right? Binning means that you lose some information, which is reflected in the fact that you get slightly different values for different binning. Would just one binning be a trustworthy result? Should I somehow add the bin width to the error on the value I am getting if I use just one binning?
 
  • #4
“Binning” (I assume that you are aggregating counts within a range of wavelengths) is equivalent to applying a low-pass filter to your data and then decimating. This filter isn’t very good for decimation (it has high sidelobes) so, unless you can be sure that the underlying signal is truly varying slowly compared to your bin intervals, you can be altering your data in subtle ways. If I were doing this, I’d fit to the raw data itself. You say there’s a lot of data, but that’s what computers are for.

Binning is fine for display purposes, however. In Matlab you’d use the hist command (histogram in the newest release) to both bin and plot a histogram of the data, then hold and overlay your best-fit Voight curve (whatever that is...) in a different color.
 
  • #5
marcusl said:
“Binning” (I assume that you are aggregating counts within a range of wavelengths) is equivalent to applying a low-pass filter to your data and then decimating. This filter isn’t very good for decimation (it has high sidelobes) so, unless you can be sure that the underlying signal is truly varying slowly compared to your bin intervals, you can be altering your data in subtle ways. If I were doing this, I’d fit to the raw data itself. You say there’s a lot of data, but that’s what computers are for.

Binning is fine for display purposes, however. In Matlab you’d use the hist command (histogram in the newest release) to both bin and plot a histogram of the data, then hold and overlay your best-fit Voight curve (whatever that is...) in a different color.
Thank you for your reply! I didn't mean that there is too much data for the computer to handle (it is actually not that much). I meant the data is taken at very small intervals in the frequency space for short periods of time and if I use the raw data, the bins (i.e. the interval between 2 increments of the frequency in the experiment) are too small to have a significant number of events to get a proper Voigt profile (the variation in the number of events from one bin to the next would be quite big). The reason why I wanted to bin the data is to have more events in a bin and hence less error. For example in the current binning I might have, in 3 consecutive bins, 2, 7, 5 events. If i make the bins big enough this would go to something more like, say, 500, 502, 499. In both cases the expectation is to have (more or less) equal number of events, but in the first case, the variation is too huge and I don't think a Voigt profile would work directly. What do you think? Thank you!
 
  • #6
A good fit will be able to handle small bins. There is a large variation from bin to bin but it isn't significant and there are many bins. If the fit has trouble with many bins this is an indication that the fit method itself has an issue.
 
  • Like
Likes jim mcnamara
  • #7
There are alternatives to fitting a histogram to determine the center of a distribution. For example, one could construct a cumulative distribution function for the data and fit the eCDF, which eliminates issues with binning. Alternatively, based on the expected shape of the distribution, you could build a model to perform maximum likelihood estimation of the parameters of your distribution.
 
  • Like
Likes jim mcnamara and marcusl
  • #8
Ygggdrasil said:
There are alternatives to fitting a histogram to determine the center of a distribution. For example, one could construct a cumulative distribution function for the data and fit the eCDF, which eliminates issues with binning. Alternatively, based on the expected shape of the distribution, you could build a model to perform maximum likelihood estimation of the parameters of your distribution.
Thank you so much for this! Using a cumulative distribution function is a great idea. Just to make sure I get it right: instead of fitting ##f(x)## to ##(x_1,y_1),...,(x_n,y_n)##, I should fit ##F(x)=\int_0^x{f(y)dy}## to ##(x_1,y_1), (x_2,y_1+y_2),...,(x_n,y_1+y_2+...y_n)##. Is this right? One thing I am not totally clear about is how to handle the errors on the individual data points. In my case the error on each point is Poisson error (what I mean is that if the number of events associated to that point in ##N## i.e. ##y_1 = N## the error is ##\delta y_1 = \sqrt{N}##). Initially, when I was using ##f(x)## to fit different bins sizes, I was weighting each point by the inverse of the error i.e. for the case mentioned before the weight in the fit would be ##w_1 = 1/\sqrt{N}##. How should I handle the error on each individual point in this cumulative case. Should I use the error propagation such that for the i-th point the error would be ##\sqrt{\delta y_1^2 + \delta y_2^2 + ... \delta y_i^2}## (this would give huge error bars for the points towards the end). Should I use the Poisson error on the newly created data points i.e. for the i-th point I should use ##\sqrt{N_1+N_2+...+N_i}##? Or something else? It seems that in both of these cases the errors get bigger and bigger so the weight associated to them in the fit will be smaller and smaller. Is this right? Intuitively I would put more weight on the points closer to the end (as they contain more of the data, hence they are closer to the real distribution), but my reasoning above says the opposite. What is the best way to deal with the errors? Thank you so much, again, for this idea!
 
  • #9
There are many things that go wrong when you fit the CDF. So many that I advise against it.

* Your bins become correlated
* Your symmetric background regions (well outside the peak) suddenly gets treated in an asymmetric way
* You have no way to tell the fit your relative certainty about the endpoint of the CDF.

----

Your fit method will probably make a maximum likelihood estimate already.
 
  • #10
mfb said:
A good fit will be able to handle small bins. There is a large variation from bin to bin but it isn't significant and there are many bins. If the fit has trouble with many bins this is an indication that the fit method itself has an issue.
The thing I am not sure about when doing this is that the relative error associated with each bin is pretty big. For example if I have a bin with 4 events, the error associated with it would be ##\sqrt{4} = 2##, so 50% relative uncertainty. If I group together multiple bins, that relative error would be significantly reduced. So I am worried that using the data the way it is now won't give a good fit just because it has such a big range of freedom in each bin. Just to clarify the situation a bit, I need to fit several Voigt profiles to my data, not just one (I said one for the sake of simplicity), so I have around 20 parameters in the end. Would this approach still work with such big relative errors? Thank you!
 
  • #11
mfb said:
Your fit method will probably make a maximum likelihood estimate already.

Are you sure about this? Most curve fitting programs perform least squares fitting, and while least squares fitting is equivalent to maximum likelihood estimation for linear regression, I don't think this is general for all types of non-linear regression.
 
  • #12
Ygggdrasil said:
Are you sure about this? Most curve fitting programs perform least squares fitting, and while least squares fitting is equivalent to maximum likelihood estimation for linear regression, I don't think this is general for all types of non-linear regression.
This is the package that I am using: https://lmfit.github.io/lmfit-py/. It seems that is uses least-squares
 
  • #13
kelly0303 said:
If I group together multiple bins, that relative error would be significantly reduced.
You do not gain information from grouping together multiple bins. If the fit method is good* it can't become better from rebinning. A large relative uncertainty in bins is fine.

*least squares is not. Among other problems least squares sees no problem with predicting negative counts and will happily predict 0.00001 when there are 2 counts. It will probably lead to a reasonable estimate if the fit function is good, but if you can get a better fit method that would help. If you need to use least squares then some rebinning might be interesting.
 
  • #14
mfb said:
You do not gain information from grouping together multiple bins. If the fit method is good* it can't become better from rebinning. A large relative uncertainty in bins is fine.

*least squares is not. Among other problems least squares sees no problem with predicting negative counts and will happily predict 0.00001 when there are 2 counts. It will probably lead to a reasonable estimate if the fit function is good, but if you can get a better fit method that would help. If you need to use least squares then some rebinning might be interesting.
Thanks for this! So, if I get it right, if I use a likelihood method, the rebinning wouldn't help so I should use the data in the form that it is (even with big relative errors). If I use least square, rebinning might help. Is this right? I will try to see how well a likelihood method works, but assuming I will stick to least squares, do you have any advice about how to proceed (going back to my first post of this thread)? Which binning should I use? Should I combine multiple binnings? And if so, how?
 
  • #15
kelly0303 said:
Thanks for this! So, if I get it right, if I use a likelihood method, the rebinning wouldn't help so I should use the data in the form that it is (even with big relative errors). If I use least square, rebinning might help. Is this right?
Yes.

With least squares: Keep bins smaller than your feature sizes, but not too much smaller. 1/10 the FWHM of peaks? If the amplitude is massively different between peak and background and your peak is narrow compared to the overall range a different bin size might be interesting (but probably not worth the effort).

Pick one and use that. Ideally you have some simulation and pick the best binning based on the simulation. Picking the bin size based on the fit result to data is bad, that can bias your result.
 
  • #16
mfb said:
Yes.

With least squares: Keep bins smaller than your feature sizes, but not too much smaller. 1/10 the FWHM of peaks? If the amplitude is massively different between peak and background and your peak is narrow compared to the overall range a different bin size might be interesting (but probably not worth the effort).

Pick one and use that. Ideally you have some simulation and pick the best binning based on the simulation. Picking the bin size based on the fit result to data is bad, that can bias your result.
Unrelated (directly) to my question: could you please explain to me (or point me to some readings) why does the least-square method depends on the bin size, while the likelihood one doesn't? Thank you!
 
  • #17
They both depend on the bin size. Put all your data in three bins and see how the fit fails. If you throw away too much information both fits will get worse.

If you have a bin with 1 hit, what is a plausible range for the true distribution in this bin? It cannot be zero or negative, and it cannot be too close to zero, otherwise you won't find a hit, and it can't be too far above 1, otherwise finding 1 is very implausible. You can find a confidence interval, e.g. 0.025-5.572 for 95% CL. It is asymmetric and it can't extend to zero. Your likelihood method will take that into account, no problem here. Your least square fit will probably just say "it is x=1, we assign sqrt(x)=1 uncertainty to it". That fit is perfectly fine with a fit function that goes to zero, or even below. This problem generally arises with small event numbers in bins - something you can avoid by rebinning.

There are also unbinned fits - necessarily using the likelihood method. They are even better if you just have a few events in some regions.
 
  • #18
mfb said:
They both depend on the bin size. Put all your data in three bins and see how the fit fails. If you throw away too much information both fits will get worse.

If you have a bin with 1 hit, what is a plausible range for the true distribution in this bin? It cannot be zero or negative, and it cannot be too close to zero, otherwise you won't find a hit, and it can't be too far above 1, otherwise finding 1 is very implausible. You can find a confidence interval, e.g. 0.025-5.572 for 95% CL. It is asymmetric and it can't extend to zero. Your likelihood method will take that into account, no problem here. Your least square fit will probably just say "it is x=1, we assign sqrt(x)=1 uncertainty to it". That fit is perfectly fine with a fit function that goes to zero, or even below. This problem generally arises with small event numbers in bins - something you can avoid by rebinning.

There are also unbinned fits - necessarily using the likelihood method. They are even better if you just have a few events in some regions.
Thank you so much for this! How did you get that confidence interval? I assume that for very low statistics the Student's t-distribution is a good way to get the confidence interval, but I think you need at least 2 points there and it is also symmetric.

One more question (if it's ok): if I have a histogram and I want to fit a function to it, say Gaussian (but anything should work, I think), can I just get the ##\mu## and ##\sigma## from my data i.e. for example ##\mu = \frac{x_1 + x_2 + ... x_n}{n}##, or do I need to actually use a fitting procedure to get the best parameters? Do they give the same result? And what confuses me a bit is how should I use the errors associated to each bin (which are ##\sqrt{N}##). If I get the parameters of the Gaussian from the data, I don't use these error at all in my computation. However, I feel like I should take the errors on my data point into account when I am doing a fit.
 
  • #19
kelly0303 said:
How did you get that confidence interval?
Calculator

If a Gaussian is a good approximation to your shape and you cover the tails well (let's say at least up to 4 sigma) then μ and σ of the data will be reasonable approximations for your fit result. If you don't have that much of the tails in your distribution it will not, if your shape is different somehow it will not, and just taking these parameters from your overall distribution won't give you uncertainties either.
kelly0303 said:
which are ##\sqrt N##
Well, sort of...
 

1. What is binning in data analysis?

Binning is the process of dividing a continuous variable into smaller, discrete categories or groups. This can help simplify the data and make it easier to interpret.

2. Why is binning important in data analysis?

Binning can help identify patterns and trends in the data that may not be apparent when looking at the continuous variable. It can also make the data more manageable and easier to analyze.

3. How do you determine the number of bins to use?

The number of bins used in binning should be based on the data itself and the desired level of detail. Too few bins can oversimplify the data, while too many bins can make it difficult to see patterns.

4. What are some common methods for binning data?

Some common methods for binning data include equal width binning, equal frequency binning, and quantile binning. Each method has its own advantages and disadvantages, and the best approach may vary depending on the data and the desired outcome.

5. How can binning affect the results of data analysis?

Binning can affect the results of data analysis by changing the distribution of the data and potentially altering the conclusions drawn from the analysis. It is important to carefully consider the binning method used and how it may impact the results.

Similar threads

  • Set Theory, Logic, Probability, Statistics
2
Replies
37
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
28
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
16
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
26
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
5K
  • Set Theory, Logic, Probability, Statistics
Replies
16
Views
1K
Replies
4
Views
7K
  • Set Theory, Logic, Probability, Statistics
Replies
24
Views
2K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
4K
Back
Top