TMVA train and test tree

realTay_John · April 3, 2023, 8:29am

Hi experts,
I am using tmva for classification problems using pyroot.
Then I Categorized the event by events number to odd tree and even tree.
And they are prepared for test and training separately:
by:

dataloader.AddSignalTree(sig_even_tree, sweight, "Training")
dataloader.AddSignalTree(sig_odd_tree, sweight, "Test")
dataloader.AddBackgroundTree(bkg_even_tree, bweight, "Training")
dataloader.AddBackgroundTree(bkg_odd_tree, bweight, "Test")

Question1: I am not sure if above division is right in pyroot using “Traing/Test” keywords.

Question2: Then I am confused, since we have appointed the training tree and test tree, then if the following codes still splitting the traing_tree into training events and test events?
Or in another sense, if I want to put the all events in sig_training tree, what should I put into the PrepareTrainingAndTestTree?
Thank you!

dataloader.PrepareTrainingAndTestTree(sCut, sCut, "nTrain_Signal=0:nTrain_Background=0:SplitMode=Random:NormMode=NumEvents:!V")
dataloader.PrepareTrainingAndTestTree(sCut, sCut, "nTrain_Signal=5000:nTrain_Background=5000:SplitMode=Random:NormMode=NumEvents:!V")

moneta · April 3, 2023, 8:57am

Hi,

As documented in the Users Guide (https://root.cern.ch/download/doc/tmva/TMVAUsersGuide.pdf page 21) you can use as option SplitMode=Block and nTrainSignal=sig_even_tree->GetEntries(). In this way you have the full content of sig_even_tree used as training data. The same thing you should do for the background tree. assuming you don’t have any cut applied to the data, you should call PrepareTrainingAndTestTree as following:

dataloader.PrepareTrainingAndTestTree(sCut, sCut, TString::Format("nTrain_Signal=%d:nTrain_Background=%d:SplitMode=Block:NormMode=NumEvents:!V",sig_even_tree->GetEntries(), bkg_even_tree->GetEntries()));

Best regards

Lorenzo

realTay_John · April 3, 2023, 9:25am

Hi @moneta , thanks for your reply and it dispelled my doubts.!

And I also get some interesting explanation from chatgpt:

In the case where a background tree is added for training only, you can safely ignore the testing subset that is produced by PrepareTrainingAndTestTree, since it will not be used in the analysis. The important thing is to ensure that the training subset is representative of the full dataset, and that it is not biased in any way that could affect the performance of the machine learning model.

moneta · April 3, 2023, 10:08am

Hi,

I am not sure what ChatGPT means with:

One has to be careful, since the text is generated and not all the time makes sense.

Best,

Lorenzo

realTay_John · April 3, 2023, 11:04am

Sure.