Hello,
I have been looking over other posts to see if they could answer my questions,
but I still am left with some questions that I hope can be answered.
I have the following sets classifications with corresponding number of events:
Background only classes:
B1 - 25,429,144 events, divided over 4 trees in 7,214,251 events, 6,490,789 events, 6,230,359 events, and 5,491,545 events
B2 - 6,326,631 events
B3 - 1,310,522 events
B4 - 117,856 events
Signal, sometimes background classes:
S1 - 124,148 events
S2 - 1,239 events
S3 - 1, 234 events
S4 - 90,716 events
Given how there are significantly fewer signal events, and how sometimes these events are treated as backgrounds with S4 vs (S1, S2, S3), I have decided to use 40,000 signal and background samples for training and testing, for a total of 80,000 signal events and 80,000 background events. With a TMVA::DataLoader()
object, this looks like
dataLoader -> PrepareTrainingAndTestTree("", 40000, 40000, 40000, 40000);
My first question: Given the samples of available classifications, is 40,000 events per training and testing sample reasonable for a BDT? Am I correct in assuming these classifications are sampled proportionally in the training and testing samples? When I don’t set something like this, TMVA seems to crash, perhaps due to overtraining.
My second question: With B1 split over 4 trees (treeB1a, treeB1b, treeB1c, treeB1d) because the hadd macro won’t add together such files containing more than these entries, with a total of 25,429,144 events, if I add them to the above dataLoader
object like
double weight = 1. / 25429144;
dataLoader -> AddBackgroundTree(treeB1a, weight);
dataLoader -> AddBackgroundTree(treeB1b, weight);
dataLoader -> AddBackgroundTree(treeB1c, weight);
dataLoader -> AddBackgroundTree(treeB1d, weight);
then when I add the background tree containing background B2
dataLoader -> AddBackgroundTree(treeB2);
will the events in treeB2
be weighted by (1 / 6,326,631), while the events in treeB1a, treeB1b, treeB1c, treeB1d
are uniformly weighted by (1 / 25,429,144)? This is assuming EqualNumEvents
normalization.