Should I split the MC used for "TMVA BDT train&test" and "MVA score"?

yhoonlee · February 6, 2023, 4:21pm

Hi All,

I am doing signal search analysis.
So, TMVA was used to distinguish between signal and background.
First, training and testing of the BDT model were performed using TMVAClassification.
And I used TMVAClassificationApplication to draw MVA score (MVA output distribution). [Picture 1]

Question: Should “the MC Sample used for TMVAClassification” and “the MC Sample used for TMVAClassificationApplication” be separated?

At first, I made a BDT model of training:test 7:3 using all the MC samples I had with TMVA. (TMVAClassification)
And I got MVA score using all the same MC Samples. (TMVAClassificationApplication)

However, within our group, there was an opinion that the same MC used for training/test of the BDT model (TMVAClassification) should not be used to draw the MVA score (TMVAClassificationApplication), but that the MC should be divided and used separately.

So now “the MC used for BDT Training/Test” and “the MC used for MVA score” were divided.
So, “the MC used for MVA score” does not have “the MC used for BDT Training/Test” and “the MC used for BDT Training/Test” does not have “the MC used for MVA score”.
But I don’t know which one is the right way.
Is the way I did it in the first place right? Is the following method correct?

Any help would be greatly appreciated.

Best Regrads,
Younghoon.

[Picture 1]
stzct is the signal, it is a light green line, and it is scaled 10 times.

moneta · February 7, 2023, 8:09am

Hi,

In TMVA the splitting in training and testing (called normally validation data in the Machine Learning community) data is done automatically for you using the DataLoader::PrepareTrainingAndTestTree, you can look at the Users Guide for the documentation.

If your validation data set is used during training, for example to control the convergence of your method, then you would need an independent data set,
called testing data set, for doing the final evaluation of your method. For example see more in Training, validation, and test data sets - Wikipedia

For the BDT in TMVA the validation data is not used during training, so in principle you don’t need to use an independent testing data set. If instead you are using a Neural Network, the validation set is normally used during training (for early stopping) and therefore you would need an independent test set.

Best regards

Lorenzo

yhoonlee · February 7, 2023, 9:56am

Dear Lorenzo,
Thank you so much for your reply.

So, the conclusion is, does that mean that “the MC used for TMVAClassification (the MC used for BDT Training/Test)” cannot be used for “TMVAClassificationApplication (the MC used for MVA score)”?
Is the way I proceeded by separating the MCs correct?

Best Regards,
Younghoon.

moneta · February 8, 2023, 11:20am

Hi,

Yes this is correct, you cannot use the same input Tree for Training/Test and then in the TMVAClassificationApplication.
But note that TMVAClassification.C evaluates all your method using the TMVA testing data set which is separated from training

Best Regards

Lorenzo

yhoonlee · February 9, 2023, 10:53am

Dear Lorenzo,

That’s what I wanted to check.
Thank you very much for your reply.
Thanks to you, I was able to proceed with confidence.

Best Regards,
Younghoon.