The rank of variables

Hai_Jiang1 · March 16, 2018, 3:42am

Dear all TMVA experts,

I am using TMVA to find the best selections for my analysis, and when I tried with BDTG, there are two kinds of rankings:

                     : Ranking input variables (method unspecific)...

IdTransformation : Ranking result (top variable is best ranked)
: -----------------------------------------------
: Rank : Variable : Separation
: -----------------------------------------------
: 1 : Jets_leading_pT : 5.237e-01
: 2 : MET_met : 5.094e-01
: 3 : dphi_emu : 3.716e-01

And the other is:

                     : Ranking input variables (method specific)...

BDTG : Ranking result (top variable is best ranked)
: --------------------------------------------------------
: Rank : Variable : Variable Importance
: --------------------------------------------------------
: 1 : dphi_emu : 1.725e-01
: 2 : Z_eta : 1.646e-01
: 3 : MET_met : 1.426e-01

Anyone know the difference between them? It seems that they have different ranking results.

Best,
Hai

kialbert · March 16, 2018, 11:01am

Hi,

For the first method, please see ch. 3.1.10 in the TMVA User’s guide.

For the second, see ch. 8.13.4 also in the TMVA User’s guide. It states:

8.13.4 Variable ranking
A ranking of the BDT input variables is derived by counting how often the variables are used to split decision tree nodes, and by weighting each split occurrence by the separation gain-squared it has achieved and by the number of events in the node [32]. This measure of the variable importance can be used for a single decision tree as well as for a forest.

Cheers,
Kim

Hai_Jiang1 · March 16, 2018, 3:45pm

Dear Kim,

Thank you so much for the explanation!
I have read those chapters you recommend, but I am still a little bit confused by the comparison of two results.
According to the 3.1.10, if the “variable separation” result from the default “method unspecific” of all those variables can represent the difference between the signal and background(and indeed it does, we can see the shape difference from variable plots), to my intuition, the “variable importance” order obtained from “BDT” training should be the same as the “variable separation” order obtained from “method unspecific” because they represent the same thing(the power of variable to discriminate signal and background)? But in fact they are not in the same order from my BDT result. Do you know why they are different?

Kind regards,
Hai

kialbert · March 16, 2018, 5:03pm

I can speculate that the BDT result depends on the cuts used final forest.

The unspecific variable separation is the separation power considering only that variable while for the BDT we calculate the statistic given a number of cuts and separation improvements. You can probably get identical results if you could force the BDT to consider only one variable for splitting. (And then repeating for each variable.)

An argument for why they can be different: Consider a single DT with a single cut and two identical variables. The two variables should have the exact same unspecific separation while the BDT ranking depends entirely on what variable was picked for the cut.

The user’s guide reference the original CART paper in section 8.13.4 if you are interested in understanding the ranking better.

Cheers,
Kim

Hai_Jiang1 · March 16, 2018, 5:38pm

Dear Kim,

Very clear explanation!
Now I got it the unspecific method is the estimation for each single variable cut based independently, while BDT of each variable is based on the combination of them all integrated together.

Thanks!
Hai