Dear TMVA experts,
I would like to ask you some questions about how does the BDTG method work for multiclass applications, since I didn’t found much about that in the Users guide and I am a novice in this field:
- Are leaf nodes classified according to the class type with the highest purity? Which is the right interpretation for the BDTG response?
- Is the Gini Index defined in the same way as for a binary problem? How is it used? I was wondering if the Gini Index was calculated for each class (treating the others as background) and then the splitting was carried out using the Index with the best increase in separation power.
- How does the Gradient Boosting algorithm properly work for a multiclass problem?
Thank you for your help and I hope these questions make some sense,
Gradient boosted decision trees are actually based on a regression minimising an arbitrary loss function (that can be e.g. cross entropy for classification). As such e.g. reasoning about the purity of a node or the Gini-index does not apply.
The output of a leaf is rather an indication of correlation with a class, but the final class decision requires considering the outputs for all classes.
BDTG in the multiclass setting builds one forest for each class, giving us a one (unbounded) scalar for each class. To this output vector the softmax function is applied to convert the output to a bounded vector (summing to 1) which can be interpreted as a point in a categorical distribution.
(The binary classification process can be considered a special case where only one forest is required.)
I hope that brings some clarity!
Thank you for your prompt reply, it was very inspiring.
One last question: if the Gini-index is not well defined for a multiclassification problem, which is the default separation criteria adopted by the BDTG method for the node splitting inside a single decision tree? I am referring to the “SeparationType” option available for the BDTG method.
Thank you again,
Happy to help! So, just to be clear, as far as I understand, Gini-index is not well defined for regression trees upon which gradient boosting is based, only for classification trees. As such
SeparationType does not apply in the BDTG case.
For information on how the regression trees are constructed you can check out this reference, in particular section 4.3 and 4.5.
In short the training process for classification defines a loss function and calculates the (functional) gradient at the input points. It then uses regressions trees to approximate this (functional) gradient and takes a small step in the negative gradient direction. (By adding the tree to the forest with a small multiplicative factor, the learning rate or shrinkage as it is called in the paper.).
This processes is then repeated until some stopping criterion is reached.
Thank you for your fast reply, I’ll definetely look into that reference.