Hi,
I’ve got a classic classification in my analysis, and using BDT to train on the MC.
My events has a large number of properties (jet1pt, angular distributions, el_pt…)
But, in my TTree i also got a std::vector named jet_pt, this vector contains the p_T of my jets in the event and for my analysis is larger than 5.
Accidentally I trained my BDT with this variable, the fact is the TMVA won’t yell at you since TTree::Draw(“jet_pt”) yields an histogram so the training went on perfectly. The problem arose when I wanted to apply the BDT to my data and then i realized that no naive variable can describe the jet_pt -> TMVA requires float/int.
BUT, the interesting part is once this variable removed, my separation got allot worse, so i started investigating a bit. It appears that when i’m using this “jet_pt” it multiplies my number of events (at least by 5 since this is my cut), both for signal & background. So my initial guess was that for every entry inside the vector it produces another event - so basically multiply the number of events. This multiplication is not straight forward, because if i simply multiply the number of events i’ve got (just duplicate the ntuples) i’m getting the same separation, just with lower errors. What is more interesting is that this variable is ranked the lowest after the BDT finished his training.
So my question is this: how come adding this std::vector to my training, improved my classification by a significant factor? Can this be actually used?