Bin Width for Overlay Plots of Data + Estimation

  • A
  • Thread starter ChrisVer
  • Start date
  • Tags
    Variable
In summary, the appropriate bin-width for an overlay plot of data + estimation depends on the purpose of the plot. To look for peaks in a spectrum, use bins finer than the experimental resolution. For cross section measurements, use larger bins. For "irregular" distributions, consider using varying sized bins based on probability intervals. It is important to ensure that the area of the bins match their probabilities, not just their height. However, changing bin sizes based on the observed data can be misleading and should be avoided.
  • #1
ChrisVer
Gold Member
3,378
464
*if this has to be moved in statistics, please do that*
I've been dragging this question for a while now, but..
When you make an overlay plot of data + estimation, what is the appropriate bin-width?
the two extremes correspond to:
single-bin for the whole variable range --> so you get the overall normalizations
too many bins in the variable range --> so you don't get anything but a flat or even broken (here and there) distribution, with 0 or 1's for the data.

My question is then, is there a rule of thumb one can use to decide the bin numbers? (eg to have reasonable errors) This becomes a little more complicated for variable-binning. For example, I was recommended that in the region where my data is rare, i should go with larger binwidths, but I don't understand the reason why. And I've seen plots that people don't do that [eg the plot here where they show the mT-distribution for a W' search... it looks like it has a fixed binwidth, but the MC-Data ratio seems to have a weirdly varied one].
 
Last edited:
Physics news on Phys.org
  • #2
It would seem that if you had a Gaussian distribution or something similar it makes sense to have larger bins near the tails of the curve and smaller bin bins near the center. In a sense, you could setup the bin to correspond to the sigma conventions.

For a more level distribution ie all choices are equally likely then it makes sense to have bins of the same size. Although here again you could use larger bins where the outliers lie.

I found this article about binning:

https://statistics.laerd.com/statistical-guides/understanding-histograms.php
 
  • #3
It depends on what you want to do.

Look for peaks in a spectrum? Use bins finer than the experimental resolution, so a peak can be visible as multiple bins showing some excess.
Perform a cross section measurement? Use bins larger than the experimental resolution (otherwise unfolding is a mess), and make the bins large enough to have a reasonable number of expected events - a measurement of "1 +- 1" is not really useful.
 
  • Like
Likes jedishrfu
  • #4
Hey ChrisVer.

You might want to consider bin sizes based on probability intervals if you have highly "irregular" distributions.

Basically if you have an interval that has a higher probability then the bin-size will be smaller (interval size for bin) than one with a lower probability.

This means you can allocate different sized intervals that keep probabilities roughly constant and that is a good indicator where the information is being allocated.
 
  • #5
If you have varying sized bins and present them on a graph like your example chart, you should make sure that the area of the bins match their probabilities, not the height.
 
  • #6
FactChecker said:
If you have varying sized bins and present them on a graph like your example chart, you should make sure that the area of the bins match their probabilities, not the height.
yup, for example the last "Data" bin (in Data-MC comparison plot) is covering the range [1000 , ~1150] GeV and is compared to the "integral" of the estimated background in that range...
However it still looks weird (why allowing a blank range [900,1000]GeV ?) ... I would have made a single-bin from 900 to the end (although almost everything above 1150 is negligible (log scale), except for the very last bin at 1600...
It's those kind of questions I find "tricky" or better "confusing" and try to clear out how to make a good choice. It looks like the last bin was thrown out because it would make the "agreement" between data and MC worse, and this is very subjective...(of course the limits and stuff are taken by different methods- looking at the CDF distribution's bins- I'm only talking about "showing" such a plot and what a reader is "looking/expecting to see in" it)
 
Last edited:
  • #7
ChrisVer said:
However it still looks weird (why allowing a blank range [900,1000]GeV ?)
Where?
If one bin starts at 1000, then there should be another bin going to 1000. In a few cases it is necessary to remove bins in the middle, e. g. because some unwanted process completely dominates the event yields there (typically some particle at a given mass), but that doesn't happen at 1000 GeV.
 
  • #8
mfb said:
Where?
in the Data vs MC plot (bottom pad).
Well the bin exists but there is no observation in that bin... so why not combine it with the previous or the next bin. As I said make a single bin above 900GeV to 1600GeV.
Well "removal" of events happens with selections as you say...in that particular bin I think it's a probability thing (the highest mT event is rare, and so the low probability of getting 1 event with an expected number of 0.XYZ events is not giving any observation there)
 
  • #9
Ah, there was the graph. The bins exist, they just did not have events in it.

Changing your bins depending on the observed data is a very bad style. You can mess around with the conclusions that way. And in this case you would lose sensitivity to new effects.
 

1. What is the purpose of using bin width for overlay plots of data + estimation?

The bin width for overlay plots of data + estimation is used to group data into intervals in order to visualize the distribution of the data. It helps to simplify the data and make it easier to interpret.

2. How do I determine the appropriate bin width for my data?

The appropriate bin width for your data depends on the range and nature of your data. It is recommended to start with a small bin width and gradually increase it until the desired visual representation of the data is achieved.

3. Can I use the same bin width for all my data sets?

No, the bin width should be determined separately for each data set as the range and distribution of the data may vary. Using the same bin width for all data sets may result in a misleading representation of the data.

4. What happens if my bin width is too large?

If the bin width is too large, it may result in a loss of important details and features in the data. The distribution of the data may appear too smooth and it may be difficult to identify any patterns or outliers.

5. Is there a standard or optimal bin width to use?

There is no standard or optimal bin width to use for all data sets. It is recommended to experiment with different bin widths and choose the one that best represents the data and is easy to interpret.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
18
Views
3K
  • Classical Physics
Replies
0
Views
69
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
464
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
990
  • Programming and Computer Science
Replies
5
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
3K
  • Calculus and Beyond Homework Help
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
9K
  • Set Theory, Logic, Probability, Statistics
Replies
29
Views
6K
Back
Top