Node purity calculation/definition

I’m having some trouble understanding the NodePurityLimit and SeparationType hyperparameters used to train a BDT.

Purity refers to how many events of one type end up a leaf, but page 117 of the TMVA user guide says about NodePuritLimit:
“I boosting/pruning, node with purity>NodePurityLimit are signal; background otherwise.”

I don’t really see how purity has to do with being signal or background. If a leaf contains only background events, its purity is 1 as well. I’ve used the CrossEntropy as SeparationType (which if I understand correctly is how the purity is calculated) and the definition of that also doesn’t distinguish event classes.


You are confusing two differenct concepts.

The node purity is the signal purity (S/(S+B)) where S is the number of signal events and B is the number of background events that fall in a node. The node purity is used to decide if the node is considered a signal node or background node (possilbly this only applies when UseYesNoLeaf=YES). This is done after training completes.

Node purity limit then decides where the threshold for the signal / background distinction is made. Additionally, the node purity is only used for the AdaBoost algorithm. It does not apply to e.g. GradBoost.

SeparationType selects what metric to use for node splitting. The training algorithm maximises in each node the given separation metric so that the resulting children is maximally separate.


Thanks for your reply!

So in case I use GradBoost, I should only look at SeparationType?
If not, what relevance does the node purity have in GradBoost if its limit is only relevant in AdaBoost?


Gradient boosting is based on regression trees (approximating a cross-entropy loss function) so the concept of classifying a single node as signal or background does not apply.

So, as you say, SeparationType still applies and is something you can investigate (even though the TMVA user’s guide seems to suggest there is scant difference between the choices).