Training and testing numbers with preselection in TMVA 4.1.2

Dear All,

I posted this on the TMVA sourceforge mailing list but there wasn’t a response and it isn’t clear if that is still an active way of talking to experts. I am reposting a slightly edited version of my question (with a bit more context) here with the hope someone might be able to help.

Thanks in advance,

-Chris


Dear All,

I was wondering if anyone can help explain if I am seeing a bug or if it is the correct behavior and if so what the rational is.

For TMVA 4.1.2 we have the following bug fix (from tmva.sourceforge.net)

“Requested number of training and testing events was not correct when pre-selection cuts were applied. Now the number of requested events scales with the preselection efficiency and hence does not need to be adjusted with the pre-selection. This also corrects the problems seen in the Category classifier, where pre-selection is used to build the categories.”

This has changed the output of an analysis I am working on radically so am trying to understand how to make it do as requested.

I am using a preselection and am asking for a particular number of events in the background sample for training (to match the statistics in the smaller signal MC sample):

factory->PrepareTrainingAndTestTree(“tau_selected==1”,
“V:nTest_Signal=1500:nTest_Background=1500:nTrain_Background=32559”);

In older versions this worked fine. As the manual points out the number of events requested is for after the pre-selection and in this case 32559 events are put into the training tree.

However, now what is happening is that the efficiency of the pre-selection is also being applied to my requested number. So the selection is going like this:

Possible in input tree: 1915192

I request: 32559

after preselection: 311197

selected 5290 (not 32559)

5290 is 32559*(311197/1915192). Is this the expected behavior? If so, could someone explain the way we should approach using the variable to get the number of events we want in each sample (including if the efficiency changes)?

Thanks!

-Chris

Hi Chris,

sorry for the very very late reply… I cannot even give any good reasons for missing this…
anyway:

In TMVA you can use two different sets of “cuts” performed on your
input data before the MVA-training.

a) Definition of what is a signal (background) event
i) imagine you have one Monte Carlo ntuple with signal and background events
then you would use this “cut” in order to specify which events are signal
and which events are background.

ii) another example could be that you have in your ntuple events from
all over the detector, but you want to use for the analysis only events
in the “barrel” reagion.

factory->AddTree(yourTree,“Signal” ,“myvar > cutBarrelOnly && myEventTypeVar=1”);
factory->AddTree(yourTree,“Background”,“myvar > cutBarrelOnly && myEventTypeVar=0”);

b) “Physics” preselection cuts
These are meant to be preselection cuts in the sense of cuts that are
applied to discriminate signal from background in such an efficient way
that one does not need an MVA algorithm to do that. Performing such
obvious/clear/efficiency cuts before the MVA training is generaly a
good idea.

These two cuts are also treated differently in TMVA when it comes to how the
number of requested training/testing events are handled. The first set of cuts,
which are mainly a service that allows the user not to have to produce "clean"
signal and background trees, is applied before any training/testing events are
collected. (i.e. the number of training events will be just as specified in the
Factory::PrepareTrainingAndTestTree

However, the 2nd preselection cuts are applied AFTER the events for the
training testing already have been choosen. Hence, the actual number used
for the training will be different compared to the one specified in
PrepareTrainingAndTestTree, (i.e. smaller according to the efficiency of the
preselection cut).

Cheers,

Helge