How the BDTG classifier response is obtained in TMVA?

Hai_Jiang · March 31, 2019, 7:43pm

Dear all TMVA experts,

Anyone knows how the BDTG response (probability of signal-like and background-like?) is calculated in the TMVA?

Is it obtained from the signal-like votes percentage of all the trees trained in the BDTG? (and then scale to (-1,1) range in BDTG variable)

Kind regards,
Hai
bdtg_plot.pdf (23.8 KB)

Axel · April 3, 2019, 9:47am

@moneta can you help?

kialbert · April 3, 2019, 10:37am

Hi,

Sorry for the delayed response!

The bdtg algorithm uses regression trees internally to, through gradient steps, approximate a loss function.

As such the mental map one can use for classification trees does not map so well.

In the classification case the output is a probability [0, 1] which is then scaled to [-1, 1] to match the other bdt output ranges in TMVA.

In some sense the final value is calculated by evaluating a function (that minimises the cross-entropy loss). The BDTs as a collection approximate this unknown function. As such the value of an individual leaf does not have an easily interpretable meaning (to my knowledge).

Does this answer your question?

Cheers,
Kim

Hai_Jiang1 · April 3, 2019, 3:28pm

Dear Kialbert,

Thanks for your detailed explanation! I have also looked at the paper you mentioned in the gradient boosting question I raised in the forum and tried to understand how the whole model works.
So can I understand it this way,
(1) each gradient decent step is one tree, and all the trees are all the steps down all the way to the minimum of cross-entropy(the local or global optimal),
(2) and each tree will be assigned its “gain” of reduction of cross-entropy(or like the effective step length towards the minimum) as the weight,
(3) and each tree will give us a signal-like probability of current event, then the “BDTG response” is calculated from the all those weighted trees signal-like probabilities results?(like draw a histogram of the signal-like probabilities of all weighted trees)

If that’s the case, I am curious how the signal-like probability is calculated in the tree, like doing the logistic regression for mapping the binary classification 0 or 1 to the continuous logistic probability density function to have the actual probability?

Sorry I have this very technical questions…As I need to explain the BDTG responds in my defense, and there are other machine leanring professors in my committee and they are curious how we got this BDTG responds because it seems the common ML application is just do the prediction rather than constructing the BDTG responds to do the background filtering in our HEP field.

Thanks again!
Hai

kialbert · April 3, 2019, 3:49pm

Hi,

First of all, there was a mistake in my previous statement. For binary classification the TMVA implementation mirrors that of Section 4.5 in the paper.

Like the logistic regression. Actually if you check section 4.5 you’ll see that the function to calculate the output probability is the logistic function.

I think your understanding is correct up until step 3. I would explain this as the output of the collection of trees being an “unconstrained probability” i.e. a high value (in the binary classification case) is correlated to a high probability of the event being signal.

In the final step these “unconstrained probabilities” are converted to a probability (multiclass case), or can be converted to a probability (binary case). The output function is designed to force the output between [0, 1] (when outputting probabilities).

Furthermore I think you can reason like this, in the case of neural networks trained with back propagation you don’t necessarily care about each individual gradient update to the parameters. The important thing is the properties of the final function. I think there is an analogy to be made here to the GBDT process.

Cheers,
Kim

Hai_Jiang1 · April 4, 2019, 4:04am

Dear Kialbert,

Thank you so much for the detailed explanation! Right now I have a clearer picture of the BDTG model.
And could you tell me more about how the “unconstrained probability”(the “high value”) is defined or calculated in each tree? Or is it also explained in that paper?

Really appreciate your help!
Hai

kialbert · April 4, 2019, 1:24pm

If I understand your question correctly, I think one can make an analogy directly to logistic regression.

In “simple” logistic regression, you use a linear model whose output you pass through the logistic function. In the binary classification case that is described in section 4.5 and implemented in TMVA the linear model is replaced by a regression tree forest.

The forest is trained using gradient boosting, but for evaluation/inference this doesn’t really matter as long as we have some guarantee of it being a minimiser.

It is the output before the application of the logistic function that I refer to as the unconstrained probability.

Cheers,
Kim