Boosted Decision Trees algorithm

Click For Summary
SUMMARY

The Boosted Decision Trees (BDT) algorithm is utilized in particle physics for effective signal and background discrimination, particularly in identifying neutral pions within the ATLAS experiment. The training process begins with a one-dimensional cut on the most discriminative variable, iteratively refining the model by applying additional cuts on subsequent variables until a minimum number of events in a subsample is achieved. This iterative classification process enhances the model's accuracy by weighting misclassified candidates more heavily, ultimately producing a likelihood estimator for signal versus background classification.

PREREQUISITES
  • Understanding of Boosted Decision Trees (BDT) methodology
  • Familiarity with particle physics concepts, particularly neutral pions
  • Knowledge of statistical significance in event classification
  • Experience with machine learning algorithms and their training processes
NEXT STEPS
  • Study the implementation of Boosted Decision Trees in machine learning libraries such as scikit-learn
  • Explore the application of BDT in particle physics research papers, focusing on ATLAS data analysis
  • Learn about statistical methods for determining minimum event thresholds in classification tasks
  • Investigate advanced techniques in feature selection and variable importance in BDTs
USEFUL FOR

Researchers in particle physics, data scientists working with classification algorithms, and machine learning practitioners interested in advanced decision tree methodologies.

ChrisVer
Science Advisor
Messages
3,372
Reaction score
465
I am not sure whether it should be here, or in statistical mathematics or in computers thread...feel free to move it. I am using it here because I am trying to understand the algorithm when it's used in particle physics (e.g. identification of neutral pions in ATLAS).

As I read it:

In general we have a some (correlated) discriminating variables and we want to combine them into one more powerful discriminant. For that we use the Boost Decision Tree (BDT) method.
The method is trained with an input of signal (cluster closest to each \pi^0 with some selection of the cluster-pion distance \Delta R <0.1) and background samples (the rest).
The training starts by applying a one-dimensional cut on the variable that provides the best discrimination of the signal and background samples.
This is subsequently repeated in both the failed and succeeded sub-samples using the next more powerful variable, until the number of events in a certain subsample has reached a minimum number of objects.
Objects are then classified as signal or background dependent on whether they are in a signal or background-like subsample. The result defines a tree.
The process is repeated weighting wrongly classified candidates higher (boosting them), and it stops when the number of pre-defined trees has been reached. And the output is the likelihood estimator of whether the object is signal or background.

My questions:
Taking a discrimination for the best discriminating variable, the algorithm will check whether some variable cut is satisfied and will lead to YES or NO.
Then to the resulting two boxes, another check is going to be done with some other variable. And so on... However I don't understand how this can stop somewhere and not keep going until all the variables are checked (I don't understand then the: "until the number of events in a certain subsample has reached a minimum number of objects").
A picture of what I am trying to explain is shown below (the weird names are just the names of the variables, S:signal, B:background):
planck_cmb.jpg
 
Physics news on Phys.org
If you just have 100 events (random number) left in a category, your statistical fluctuations are too large to draw reasonable conclusions. You would just define your cuts based on random fluctuations and the resulting S/(S+B) values would look much better than they really are.
 

Similar threads

  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 2 ·
Replies
2
Views
5K
  • · Replies 15 ·
Replies
15
Views
3K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 16 ·
Replies
16
Views
3K
  • · Replies 1 ·
Replies
1
Views
6K
  • · Replies 13 ·
Replies
13
Views
4K
  • · Replies 2 ·
Replies
2
Views
3K
Replies
17
Views
4K