TMVA with std::vector as input variable

hcohen · December 23, 2015, 1:20pm

Hi,

I’ve got a classic classification in my analysis, and using BDT to train on the MC.
My events has a large number of properties (jet1pt, angular distributions, el_pt…)
But, in my TTree i also got a std::vector named jet_pt, this vector contains the p_T of my jets in the event and for my analysis is larger than 5.

Accidentally I trained my BDT with this variable, the fact is the TMVA won’t yell at you since TTree::Draw(“jet_pt”) yields an histogram so the training went on perfectly. The problem arose when I wanted to apply the BDT to my data and then i realized that no naive variable can describe the jet_pt -> TMVA requires float/int.

BUT, the interesting part is once this variable removed, my separation got allot worse, so i started investigating a bit. It appears that when i’m using this “jet_pt” it multiplies my number of events (at least by 5 since this is my cut), both for signal & background. So my initial guess was that for every entry inside the vector it produces another event - so basically multiply the number of events. This multiplication is not straight forward, because if i simply multiply the number of events i’ve got (just duplicate the ntuples) i’m getting the same separation, just with lower errors. What is more interesting is that this variable is ranked the lowest after the BDT finished his training.

So my question is this: how come adding this std::vector to my training, improved my classification by a significant factor? Can this be actually used?

hcohen · December 23, 2015, 4:07pm

overtrain_BDT0_inclusive_H300_SR64.eps (28.8 KB)
overtrain_BDT8_ttbb_jp_H300_SR64.eps (21.4 KB)

iw273 · October 11, 2018, 11:05am

Hi, did you ever get an answer as to how exactly this works and why this gives an improvement?

kialbert · October 11, 2018, 2:47pm

Hi,

When adding a vector-type variable to TMVA it’s considered to be a collection of data within an event. E.g. the jet_pt in this case would represent each individual jet. To handle this in a generic way, TMVA adds a TMVA::Event for each entry in the vector(s) while keeping the non-vector data constant for that set of events.

E.g. a data point with variables num_jets and jet_pt with values [1, 2] and [[30], [45, 10]] would be added to TMVA as 3 TMVA::Events, namely:

1) [1, 30]
2) [2, 45]
3) [2, 10]

This can lead to problems if the weight normalisation is not done properly. E.g. if you replicate signal events more on average because of your selection than you do background events and do not normalise the class weights the final performance is skewed towards the signal class.

Even if you do per-class normalisation you can still have problems since TMVA currently does not automatically normalise the event weight for replicated events, it is assumed you also provide a vector variable for your weights or have already premultiplied your weights before entering them to TMVA. This can be done quite easily by e.g.

dataloader->SetSignalWeightExpression("weight/num_jets");

Cheers,
Kim

iw273 · October 12, 2018, 8:00am

Hi Kim,

Great, thanks for the explanation.

Ifan