Bin Width for Overlay Plots of Data + Estimation

ChrisVer · Sep 6, 2016

*if this has to be moved in statistics, please do that*
I've been dragging this question for a while now, but..
When you make an overlay plot of data + estimation, what is the appropriate bin-width?
the two extremes correspond to:
single-bin for the whole variable range --> so you get the overall normalizations
too many bins in the variable range --> so you don't get anything but a flat or even broken (here and there) distribution, with 0 or 1's for the data.

My question is then, is there a rule of thumb one can use to decide the bin numbers? (eg to have reasonable errors) This becomes a little more complicated for variable-binning. For example, I was recommended that in the region where my data is rare, i should go with larger binwidths, but I don't understand the reason why. And I've seen plots that people don't do that [eg the plot here where they show the mT-distribution for a W' search... it looks like it has a fixed binwidth, but the MC-Data ratio seems to have a weirdly varied one].

jedishrfu · Sep 6, 2016

It would seem that if you had a Gaussian distribution or something similar it makes sense to have larger bins near the tails of the curve and smaller bin bins near the center. In a sense, you could setup the bin to correspond to the sigma conventions.

For a more level distribution ie all choices are equally likely then it makes sense to have bins of the same size. Although here again you could use larger bins where the outliers lie.

I found this article about binning:

https://statistics.laerd.com/statistical-guides/understanding-histograms.php

mfb · Sep 6, 2016

It depends on what you want to do.

Look for peaks in a spectrum? Use bins finer than the experimental resolution, so a peak can be visible as multiple bins showing some excess.
Perform a cross section measurement? Use bins larger than the experimental resolution (otherwise unfolding is a mess), and make the bins large enough to have a reasonable number of expected events - a measurement of "1 +- 1" is not really useful.

chiro · Sep 8, 2016

Hey ChrisVer.

You might want to consider bin sizes based on probability intervals if you have highly "irregular" distributions.

Basically if you have an interval that has a higher probability then the bin-size will be smaller (interval size for bin) than one with a lower probability.

This means you can allocate different sized intervals that keep probabilities roughly constant and that is a good indicator where the information is being allocated.

FactChecker · Sep 17, 2016

If you have varying sized bins and present them on a graph like your example chart, you should make sure that the area of the bins match their probabilities, not the height.

ChrisVer · Sep 17, 2016

FactChecker said:

If you have varying sized bins and present them on a graph like your example chart, you should make sure that the area of the bins match their probabilities, not the height.

yup, for example the last "Data" bin (in Data-MC comparison plot) is covering the range [1000 , ~1150] GeV and is compared to the "integral" of the estimated background in that range...
However it still looks weird (why allowing a blank range [900,1000]GeV ?) ... I would have made a single-bin from 900 to the end (although almost everything above 1150 is negligible (log scale), except for the very last bin at 1600...
It's those kind of questions I find "tricky" or better "confusing" and try to clear out how to make a good choice. It looks like the last bin was thrown out because it would make the "agreement" between data and MC worse, and this is very subjective...(of course the limits and stuff are taken by different methods- looking at the CDF distribution's bins- I'm only talking about "showing" such a plot and what a reader is "looking/expecting to see in" it)

mfb · Sep 17, 2016

ChrisVer said:

However it still looks weird (why allowing a blank range [900,1000]GeV ?)

Where?
If one bin starts at 1000, then there should be another bin going to 1000. In a few cases it is necessary to remove bins in the middle, e. g. because some unwanted process completely dominates the event yields there (typically some particle at a given mass), but that doesn't happen at 1000 GeV.

ChrisVer · Sep 17, 2016

mfb said:

Where?

in the Data vs MC plot (bottom pad).
Well the bin exists but there is no observation in that bin... so why not combine it with the previous or the next bin. As I said make a single bin above 900GeV to 1600GeV.
Well "removal" of events happens with selections as you say...in that particular bin I think it's a probability thing (the highest mT event is rare, and so the low probability of getting 1 event with an expected number of 0.XYZ events is not giving any observation there)

mfb · Sep 17, 2016

Ah, there was the graph. The bins exist, they just did not have events in it.

Changing your bins depending on the observed data is a very bad style. You can mess around with the conclusions that way. And in this case you would lose sensitivity to new effects.

Bin Width for Overlay Plots of Data + Estimation

1. What is the purpose of using bin width for overlay plots of data + estimation?

2. How do I determine the appropriate bin width for my data?

3. Can I use the same bin width for all my data sets?

4. What happens if my bin width is too large?

5. Is there a standard or optimal bin width to use?

Similar threads

Hot Threads

Recent Insights