Home | News | Documentation | Download

Cannot evaluate multiclass BDT response with bagging

If I do a multiclass analysis with bagging enabled, the BDT response histogram has all entries assigned to the overflow bin. Effectively, this means I cannot evaluate my BDT response in a multiclass analysis with bagging. I have no such problem with a standard classification analysis.

Reproducer:


import ROOT as r

r.RDataFrame(1000).Define("v", "gRandom->Gaus(5, 5)").Define("u", "gRandom->Landau(5, 5)").Define("e", "rdfentry_").Snapshot("atree", "sig.root")
r.RDataFrame(1000).Define("v", "gRandom->Gaus(1, 3)").Define("u", "gRandom->Landau(1, 3)").Define("e", "rdfentry_").Snapshot("atree", "bkg.root")
r.RDataFrame(1000).Define("v", "gRandom->Gaus(-1, 10)").Define("u", "gRandom->Landau(-1, 10)").Define("e", "rdfentry_").Snapshot("atree", "oth.root")
fsig = r.TFile.Open("sig.root")
tsig = fsig.atree
fbkg = r.TFile.Open("bkg.root")
tbkg = fbkg.atree
foth = r.TFile.Open("oth.root")
toth = foth.atree
fout = r.TFile.Open("out.root", "recreate")

dl = r.TMVA.DataLoader("dataset")
dl.AddVariable("v", "Gaussian distribution", "", "F")
dl.AddVariable("u", "Landau distribution", "", "F")
dl.AddSpectator("e", "entry number", "")
dl.AddTree(tsig, "sig")
dl.AddTree(tbkg, "bkg")
dl.AddTree(toth, "oth")

dl.PrepareTrainingAndTestTree("", r"nTest_sig=0:nTest_bkg=0:nTest_oth=0:NormMode=NumEvents:!V:SplitSeed=100:SplitMode=Random")
fact = r.TMVA.Factory("TMVAClassification", fout, r"!V:!Silent:AnalysisType=Multiclass")
fact.BookMethod(
    dl,
    r.TMVA.Types.kBDT,
    "BDT",
    r"!H:!V:nTrees=500:BoostType=Grad"
    r":UseBaggedGrad"  # problem
)
fact.TrainAllMethods()
fact.TestAllMethods()
fact.EvaluateAllMethods()

fout.Close()

f = r.TFile.Open("out.root")
h = f.dataset.Method_BDT.BDT.MVA_BDT_Test_sig_prob_for_sig
overflowbin = h.GetNbinsX() + 1
if h.GetEntries() == h.GetBinContent(overflowbin):
    print("BDT response assigned to overflow")
else:
    print("BDT response assigned properly")

Th BDT response is assigned properly after commenting/removing the line marked # problem.

@swunsch @moneta can you please take a look whenever you will have a time? Thank you!

@moneta ping

Hi,
The problem is that you are using a large number of trees and few events for training and this does not work well when using Bagging. If I increase for example the total number of input events/category to 10000 and use 100 trees it works fine.

Note that the option UseBaggedGrad is deprecated and you should use UseBaggedBoost

Lorenzo

Ah, I see. Thank you for the help!