Tree growing in BDT method

Einsiedler · January 10, 2019, 11:23pm

Hello everyone,

I have a quick question regarding the growth of multiple decision trees. Assuming that the ‘UsedRandomisedTrees’ is not activated, I am wondering how it is possible that root nodes of different trees using the same variable for splitting use different cut values?
If this option is not activated, shouldn’t it always be the same full training sample used in all root nodes and therefore, all cut values on the same variables in these root nodes should be the same? I am a bit confused by that aspect.

Thank you in advance or any answer you can provide.

kialbert · January 11, 2019, 9:40am

Hi,

BDT’s are a stagewise approximation. That is each tree reduces the error between the BDT output and the target function, and uses the output of the all the previous trees. Thus the input to the optimisation process is different for each tree iteration, and the cuts are different.

Also, please note that UseRandomisedTrees does a subsampling of the variables (implying that if a variable is selected for use, all data samples are used from that variable), meaning that for convergence properties it does not matter whether it is enabled or not (it does increase the randomness of the search though which can lead to better generalisation).

Einsiedler · January 11, 2019, 4:25pm

Thank you very much for the clarification. I had indeed forgotten about the very essence of boosting, meaning that misclassified events have more weight in the next trees and that therefore cuts may vary.

Regarding your comment about UseRandomisedTrees, does it always take a sub-sample of the training events different for each tree?

kialbert · January 12, 2019, 8:19am

Hi,

No worries, we’re here to help!

When using randomised trees all events are used. (Variables are selected randomly for eaxh tree though).

Cheers,
Kim

Einsiedler · January 13, 2019, 8:49pm

Thanks again. But I am a bit confused regarding your statement “When using randomised trees all events are used.” In fact, it says at page 120 of the User Guide https://root.cern.ch/download/doc/tmva/TMVAUsersGuide.pdf that for randomized trees, only a subset (resampled) of the original training events is used.

Cheers,
Kevin

kialbert · January 14, 2019, 2:40pm

The manual is a bit unclear on this point and, also outdated (the option UseNTrainEvents has been renamed to BaggedSampleFraction and requires that the option UseBaggedBoost is enabled.)

Thanks for pointing this out to us!

The use of subsampling of variables and a subsampling of events are orthogonal and are controlled by different options. The manual will be updated to reflect this.

Cheers,
Kim