About BDT option UseBaggedBoost and BaggedSampleFraction

JungChang · January 4, 2019, 3:59am

Dear Sir,

I have a question about these two option UseBaggedBoost and BaggedSampleFraction with BDT AdaBoost setting.

I found the definition of this two option as,
UseBaggedBoost: Use only a random subsample of all events for growing the trees in each boost
iteration, BaggedSampleFraction: Relative size of bagged event sample to original size of the
data sample

And the example code for BDT Classfication option given,

   if (Use["BDT"])  // Adaptive Boost
      factory->BookMethod( dataloader, TMVA::Types::kBDT, "BDT",
                           "!H:!V:NTrees=850:MinNodeSize=2.5%:MaxDepth=3:BoostType=AdaBoost:AdaBoostBeta=0.5:UseBaggedBoost:BaggedSampleFraction=0.5:SeparationType=GiniIndex:nCuts=20" );

Is it means it will only randomly used 50% of total events for each daughter node growing,
i.e. each daughter node use different randomly chosen 50% events?

Thanks.

Best,
Jung

Axel · January 5, 2019, 8:23pm

@moneta or @kialbert might be able to have a look when back at work next week!

kialbert · January 7, 2019, 6:45pm

Hi and a happy new year!

Not quite, each boost iteration uses a redraw, not each node.

I.e. using BaggedSampleFraction=0.5 will, for each new tree, redraw 50% of the original training sample to use as the bagged training sample.

Cheers,
Kim

JungChang · January 8, 2019, 1:45am

Dear Kim,

Thanks for the reply.

Happy new year.

May I ask, do you mean in each tree iteration it will use 50% original data and the other 50% is re-weighted data (if I used AdaBoost)?

Thanks!

Best,
Jung

kialbert · January 9, 2019, 3:05pm

Hi,

To my understanding of the code the TMVA AdaBoost draws a subsample from the training set and sends this subsample to the boosting process. The reweighing is done on the subsample.

A new subsample is then draws from the full training data and the process repeated.

Cheers,
Kim

jwruss · January 17, 2019, 12:13am

Hi Kim,

I haven’t found the UseBaggedBoost feature in the TMVA User’s Guide, so I hope it’s okay to ask some further clarifying questions about it and BaggedSampleFraction.

If I were to set BaggedSampleFraction = 0.6, does that mean 60% of the subsample randomly selected from the original training sample for boosting will still be used as the bagged training sample, and the other 40% will be resampled randomly from the original training sample, or vice versa?
Is the same bagged subsample used over all boost iterations, or is a fraction equal to BaggedSampleFraction kept from the training sample of the previous boost iteration? So if I were to set BaggedSampleFraction = 0.6, then will the second boost iteration’s training sample be comprised of 60% of the first iteration’s training sample, and then the third iteration’s training sample be comprised of the 60% of the second iteration’s training sample?

kialbert · January 17, 2019, 11:13am

Hi,

The UseBaggedBoost option is a generalisation of UseBaggedGrad (the latter used to only apply to gradient boosting, now they are equivalent but the former is preferred).

If you use BaggedSampleFraction=0.6 then, for each tree, 60% of the original training sample will be used as the training sample for that iteration. The training sample will be redrawn (from the original training sample) before the next tree is trained. (For completeness, the redraw is done with replacement).
I think this is covered in the answer to (1)

N.B: When discussing randomised trees it can be instructive to think of the TMVA options like so:

UseNVars subsamples the features
BaggedSampleFraction subsamples the data(/events/samples)

These two components provide different forms of stochasticity for classifier training and both are leveraged in what Breiman and Cutler calls “Random Forests”.

To use the Breiman/Cutler “Random Forest” method in TMVA one can use UseRandomisedTress:UseBaggedBoost where the former option enables feature subsampling and the latter data subsampling.

Cheers,
Kim

jwruss · January 29, 2019, 1:04am

Hi Kim,

This answer is certainly informative, but I think I might need a little more information to best comprehend it. Does your answer in part (1) regarding BaggedSampleFraction=0.6 mean that the same 60% of the original training sample is used to train each tree, with the other 40% randomly redrawn? If this is the case, is there an option where 60% of the previous iteration’s training sample is kept, while 40% is resampled?

kialbert · February 1, 2019, 12:35pm

Hi,

No, assuming BaggedSampleFraction=0.6 then 60% of the original training set is used for training that iteration and 40% is left out. Over the course over all iterations, all events from the original training set are used (or at least have a high probability of being used).

The procedure can be written like this:

When starting the BDT training we define a set of events, originalTrainingSet.
For each tree:
- Draw N*BaggedSampleFraction events with replacement from originalTrainingSet. Call this set training_set_i. N is the number of events in originalTrainingSet.
- Train a new tree using training_set_i.
- Add the tree to the forest and repeat until all trees/iterations are done.

Hope that clears things up a little!

Cheers,
Kim

jwruss · February 2, 2019, 12:25am

Thank you, Kim, this really does clear up my misunderstanding!