Bin Width for Overlay Plots of Data + Estimation

  • Context: Graduate 
  • Thread starter Thread starter ChrisVer
  • Start date Start date
  • Tags Tags
    Variable
Click For Summary

Discussion Overview

The discussion revolves around the appropriate bin width for overlay plots of data and estimation, particularly in the context of statistical analysis and visualization. Participants explore various strategies for binning data, considering factors such as distribution shape, experimental resolution, and the impact of bin size on the representation of data.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant questions the existence of a rule of thumb for determining bin numbers, noting the trade-off between using a single bin for overall normalization and too many bins leading to a flat distribution.
  • Another participant suggests that for Gaussian-like distributions, larger bins should be used near the tails and smaller bins near the center, while also considering the use of larger bins for outliers in more uniform distributions.
  • A different viewpoint emphasizes the need for finer bins than the experimental resolution when looking for peaks in a spectrum, while larger bins are recommended for cross-section measurements to avoid complications in unfolding data.
  • One participant proposes using probability intervals to determine bin sizes, suggesting that higher probability intervals should have smaller bins to reflect the distribution of information accurately.
  • Concerns are raised about ensuring that the area of varying sized bins corresponds to their probabilities rather than their heights when presenting data graphically.
  • Another participant expresses confusion over the presence of a blank range in a data plot, questioning the rationale behind not combining bins in certain ranges and suggesting that subjective choices may affect the perceived agreement between data and Monte Carlo simulations.
  • There is a caution against changing bin sizes based on observed data, as this could lead to misleading conclusions and loss of sensitivity to new effects.

Areas of Agreement / Disagreement

Participants express a range of views on binning strategies, with no consensus reached on a single approach. The discussion highlights differing opinions on how to handle bin sizes based on distribution characteristics and experimental goals.

Contextual Notes

Participants note that the choice of bin sizes can depend heavily on the specific context of the data being analyzed, including the nature of the distribution and the experimental setup. There are also mentions of subjective decisions impacting the representation of data.

ChrisVer
Science Advisor
Messages
3,372
Reaction score
465
*if this has to be moved in statistics, please do that*
I've been dragging this question for a while now, but..
When you make an overlay plot of data + estimation, what is the appropriate bin-width?
the two extremes correspond to:
single-bin for the whole variable range --> so you get the overall normalizations
too many bins in the variable range --> so you don't get anything but a flat or even broken (here and there) distribution, with 0 or 1's for the data.

My question is then, is there a rule of thumb one can use to decide the bin numbers? (eg to have reasonable errors) This becomes a little more complicated for variable-binning. For example, I was recommended that in the region where my data is rare, i should go with larger binwidths, but I don't understand the reason why. And I've seen plots that people don't do that [eg the plot here where they show the mT-distribution for a W' search... it looks like it has a fixed binwidth, but the MC-Data ratio seems to have a weirdly varied one].
 
Last edited:
Physics news on Phys.org
It would seem that if you had a Gaussian distribution or something similar it makes sense to have larger bins near the tails of the curve and smaller bin bins near the center. In a sense, you could setup the bin to correspond to the sigma conventions.

For a more level distribution ie all choices are equally likely then it makes sense to have bins of the same size. Although here again you could use larger bins where the outliers lie.

I found this article about binning:

https://statistics.laerd.com/statistical-guides/understanding-histograms.php
 
It depends on what you want to do.

Look for peaks in a spectrum? Use bins finer than the experimental resolution, so a peak can be visible as multiple bins showing some excess.
Perform a cross section measurement? Use bins larger than the experimental resolution (otherwise unfolding is a mess), and make the bins large enough to have a reasonable number of expected events - a measurement of "1 +- 1" is not really useful.
 
  • Like
Likes   Reactions: jedishrfu
Hey ChrisVer.

You might want to consider bin sizes based on probability intervals if you have highly "irregular" distributions.

Basically if you have an interval that has a higher probability then the bin-size will be smaller (interval size for bin) than one with a lower probability.

This means you can allocate different sized intervals that keep probabilities roughly constant and that is a good indicator where the information is being allocated.
 
If you have varying sized bins and present them on a graph like your example chart, you should make sure that the area of the bins match their probabilities, not the height.
 
FactChecker said:
If you have varying sized bins and present them on a graph like your example chart, you should make sure that the area of the bins match their probabilities, not the height.
yup, for example the last "Data" bin (in Data-MC comparison plot) is covering the range [1000 , ~1150] GeV and is compared to the "integral" of the estimated background in that range...
However it still looks weird (why allowing a blank range [900,1000]GeV ?) ... I would have made a single-bin from 900 to the end (although almost everything above 1150 is negligible (log scale), except for the very last bin at 1600...
It's those kind of questions I find "tricky" or better "confusing" and try to clear out how to make a good choice. It looks like the last bin was thrown out because it would make the "agreement" between data and MC worse, and this is very subjective...(of course the limits and stuff are taken by different methods- looking at the CDF distribution's bins- I'm only talking about "showing" such a plot and what a reader is "looking/expecting to see in" it)
 
Last edited:
ChrisVer said:
However it still looks weird (why allowing a blank range [900,1000]GeV ?)
Where?
If one bin starts at 1000, then there should be another bin going to 1000. In a few cases it is necessary to remove bins in the middle, e. g. because some unwanted process completely dominates the event yields there (typically some particle at a given mass), but that doesn't happen at 1000 GeV.
 
mfb said:
Where?
in the Data vs MC plot (bottom pad).
Well the bin exists but there is no observation in that bin... so why not combine it with the previous or the next bin. As I said make a single bin above 900GeV to 1600GeV.
Well "removal" of events happens with selections as you say...in that particular bin I think it's a probability thing (the highest mT event is rare, and so the low probability of getting 1 event with an expected number of 0.XYZ events is not giving any observation there)
 
Ah, there was the graph. The bins exist, they just did not have events in it.

Changing your bins depending on the observed data is a very bad style. You can mess around with the conclusions that way. And in this case you would lose sensitivity to new effects.
 

Similar threads

  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 29 ·
Replies
29
Views
7K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
Replies
9
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 12 ·
Replies
12
Views
5K