I am using tmva for classification problems using pyroot.
Then I Categorized the event by events number to odd tree and even tree.
And they are prepared for test and training separately:
dataloader.AddSignalTree(sig_even_tree, sweight, "Training")
dataloader.AddSignalTree(sig_odd_tree, sweight, "Test")
dataloader.AddBackgroundTree(bkg_even_tree, bweight, "Training")
dataloader.AddBackgroundTree(bkg_odd_tree, bweight, "Test")
Question1: I am not sure if above division is right in pyroot using “Traing/Test” keywords.
Question2: Then I am confused, since we have appointed the training tree and test tree, then if the following codes still splitting the traing_tree into training events and test events?
Or in another sense, if I want to put the all events in sig_training tree, what should I put into the PrepareTrainingAndTestTree?
dataloader.PrepareTrainingAndTestTree(sCut, sCut, "nTrain_Signal=0:nTrain_Background=0:SplitMode=Random:NormMode=NumEvents:!V")
dataloader.PrepareTrainingAndTestTree(sCut, sCut, "nTrain_Signal=5000:nTrain_Background=5000:SplitMode=Random:NormMode=NumEvents:!V")
As documented in the Users Guide (https://root.cern.ch/download/doc/tmva/TMVAUsersGuide.pdf page 21) you can use as option
nTrainSignal=sig_even_tree->GetEntries(). In this way you have the full content of
sig_even_tree used as training data. The same thing you should do for the background tree. assuming you don’t have any cut applied to the data, you should call PrepareTrainingAndTestTree as following:
dataloader.PrepareTrainingAndTestTree(sCut, sCut, TString::Format("nTrain_Signal=%d:nTrain_Background=%d:SplitMode=Block:NormMode=NumEvents:!V",sig_even_tree->GetEntries(), bkg_even_tree->GetEntries()));
Hi @moneta , thanks for your reply and it dispelled my doubts.!
And I also get some interesting explanation from chatgpt:
In the case where a background tree is added for training only, you can safely ignore the testing subset that is produced by
PrepareTrainingAndTestTree, since it will not be used in the analysis. The important thing is to ensure that the training subset is representative of the full dataset, and that it is not biased in any way that could affect the performance of the machine learning model.
I am not sure what ChatGPT means with:
One has to be careful, since the text is generated and not all the time makes sense.