With multiple signal and background types, handling one signal or background type divided over multiple trees using weights

jwruss · January 11, 2019, 2:04am

Hello,

I have been looking over other posts to see if they could answer my questions,

but I still am left with some questions that I hope can be answered.

I have the following sets classifications with corresponding number of events:
Background only classes:
B1 - 25,429,144 events, divided over 4 trees in 7,214,251 events, 6,490,789 events, 6,230,359 events, and 5,491,545 events
B2 - 6,326,631 events
B3 - 1,310,522 events
B4 - 117,856 events

Signal, sometimes background classes:
S1 - 124,148 events
S2 - 1,239 events
S3 - 1, 234 events
S4 - 90,716 events

Given how there are significantly fewer signal events, and how sometimes these events are treated as backgrounds with S4 vs (S1, S2, S3), I have decided to use 40,000 signal and background samples for training and testing, for a total of 80,000 signal events and 80,000 background events. With a TMVA::DataLoader() object, this looks like
dataLoader -> PrepareTrainingAndTestTree("", 40000, 40000, 40000, 40000);

My first question: Given the samples of available classifications, is 40,000 events per training and testing sample reasonable for a BDT? Am I correct in assuming these classifications are sampled proportionally in the training and testing samples? When I don’t set something like this, TMVA seems to crash, perhaps due to overtraining.

My second question: With B1 split over 4 trees (treeB1a, treeB1b, treeB1c, treeB1d) because the hadd macro won’t add together such files containing more than these entries, with a total of 25,429,144 events, if I add them to the above dataLoader object like

double weight = 1. / 25429144;
dataLoader -> AddBackgroundTree(treeB1a, weight);
dataLoader -> AddBackgroundTree(treeB1b, weight);
dataLoader -> AddBackgroundTree(treeB1c, weight);
dataLoader -> AddBackgroundTree(treeB1d, weight);

then when I add the background tree containing background B2
dataLoader -> AddBackgroundTree(treeB2);
will the events in treeB2 be weighted by (1 / 6,326,631), while the events in treeB1a, treeB1b, treeB1c, treeB1d are uniformly weighted by (1 / 25,429,144)? This is assuming EqualNumEvents normalization.

kialbert · January 14, 2019, 1:19pm

For your first question: What is a good number of samples depends on the problem. One way to gain insight here is to do the training and compare the output distribution of the TMVA training and test samples with the distribution on data not trained nor tested on. If there are no significant differences you are using a decent number of samples.

It could be that your machine runs out of memory when trying to train BDT’s for 30000000 events.

For your second question: EqualNumEvents weighs per class meaning that if you have data from three different sources, all going into the background class, you have to weigh them all manually.

In your example EqualNumEvents would weigh each event by 1./(25429144+6326631). If you want to ensure that also B2 has the correct weight you would need to:

dataLoader->AddBackgroundTree(treeB2, 1./6326631);

Or, alternatively, use multiclass classification which can natively handle such cases.
Cheers,
Kim

jwruss · January 14, 2019, 6:14pm

Hi Kim,

Thank you for your responses. I have reformatted the second question a little bit. If I don’t explicitly set dataLoader->AddBackgroundTree(treeB2, 1./6326631), like I do with treeB1a through treeB1d above, what happens if I don’t explicitly set weights?

MVAAllUpdate.C (14.2 KB)

This question relates back to the above file from this post. With the way I have set up my TMVA::DataLoader() objects here in the file, is my goal of treating the collection of thermal background trees (thermal0SumTree through thermal3SumTree) as a single background achieved with setting

int numThermal = thermal0SumTree -> GetEntries() + thermal1SumTree -> GetEntries() + thermal2SumTree -> GetEntries() + thermal3SumTree -> GetEntries();
double thermalWeight = 1. / numThermal;

dataLoader -> AddBackgroundTree(thermal0SumTree, thermalWeight);
dataLoader -> AddBackgroundTree(thermal1SumTree, thermalWeight);
dataLoader -> AddBackgroundTree(thermal2SumTree, thermalWeight);
dataLoader -> AddBackgroundTree(thermal3SumTree, thermalWeight);

Or if I do this, would I also have to also manually set weights to the other background trees as well? The other backgrounds and signals are successfully contained in single trees, unlike this one thermal background which is split over 4 trees. I worry that if I just add the 4 thermal trees as background trees but without weights, their differing number of events could bias collecting training and testing samples towards one of the thermal trees over the others, when I would like sampling to be equally likely between these trees.

kialbert · January 15, 2019, 2:03pm

Hi,

In short, yes, if you set the weights for all trees you can achieve equal sampling. If you don’t, the selection will indeed be favouring either the thermal background or the others depending on configuration.

The important thing for normalisation is the relative importance (weight) of the different trees. If all input events are equally important you don’t have to change a thing. If 25 million events of bkga should be equally important as 6 million events of bkgb the trees need to be normalised to reflect this.

One way to do this is to set the tree weight of bkga to 1./25.e6 and that of bkgb to 1./6e6. Another valid pair of weights would be to set the tree weight of bkga to 6e6/25e6 and that of bkgb to 1..

EqualNumEvents norm mode ensures that the weighted sum is the same for all classes. It does this by reweighing samples of a class with the same constant.

Warning (overly?) detailed example below

// sig_tree_1 has 1000 events
// sig_tree_2 has 500 events
// bkga_tree_1 has 20000 events
// bkga_tree_2 has 5000 events
// bkgb_tree_1 has 6000 events

// In this case the initial event weights are set to 1.
dataloader->AddSignalTree(sig_tree_1);
dataloader->AddSignalTree(sig_tree_2);

// Add two background types bkga, and bkgb, of equal importance
dataloader->AddBackgroundTree(bkga_tree_1, 1./25000);
dataloader->AddBackgroundTree(bkga_tree_2, 1./25000);
dataloader->AddBackgroundTree(bkgb_tree_1, 1./6000);

dataloader->PrepareTrainingAndTestTree("", "", "NormMode=EqualNumEvents");

// Weights before normalisation:
// sig_tree_1 : 1 (per event) * 1000 (events)
// sig_tree_2 : 1 (per event) * 500 (events)
// sum of weight signal: 1500 (w_sig)

// bkga_tree_1 : 1./25000 (per event) * 20000 (events)
// bkga_tree_2 : 1./25000 (per event) * 5000 (events)
// bkgb_tree_1 : 1./6000 (per event) * 6000 (events)
// sum of weight background: 1 (w_bkg)

// Weight after normalisation:

// number of (raw) signal events: 1500 (n_sig)
// sig_norm = n_sig / w_sig
// bkg_norm = n_sig / w_bkg

// sig_tree_1 : 1 (per event) * 1000 (events) * 1500/1500 (n_sig/wsig)
// sig_tree_2 : 1 (per event) * 500 (events) * 1500/1500 (n_sig/wsig)
// sum of weight signal: 1500

// bkga_tree_1 : 1./25000 (per event) * 20000 (events) * 1500./1. (n_sig/w_bkg)
// bkga_tree_2 : 1./25000 (per event) * 5000 (events) * 1500./1. (n_sig/w_bkg)
// bkgb_tree_1 : 1./6000 (per event) * 6000 (events) * 1500./1. (n_sig/w_bkg)
// sum of weight background: 1500

Cheers,
Kim

jwruss · January 15, 2019, 7:14pm

Thanks, Kim! To me the example just the right amount of detailed.

Although following the example, should the comments at the top read

// bkga_tree_1 has 20000 events
// bkga_tree_2 has 5000 events

Should bkga_tree_1 contain 20000 events in your example, so that the total sample size of bkga is 25000?

From page 21 in the TMVA User’s Guide, isn’t the default behavior NormMode=EqualNumEvents? What is the purpose of

dataloader->PrepareTrainingAndTestTree("", "", "NormMode-EqualNumEvents");

?

kialbert · January 17, 2019, 10:57am

Hi,

No worries!

Yes indeed, this is how it should be. I’ll update the previous post

The purpose is to be explicit.

Cheers,
Kim

jwruss · February 7, 2020, 9:33pm

Hi @kialbert,

I have another question related to what has been discussed here.

Suppose I have three backgrounds A, B, C with corresponding number of events N_A, N_B, and N_C and events stored in corresponding trees treeA, treeB, treeC.

If together A and B represent my background I want to train on, then if I wanted them to be equally represented I would do

dataLoader -> AddBackgroundTree(treeA, 1 / N_A);
dataLoader -> AddBackgroundTree(treeB, 1 / N_B);

Now suppose I want to add background C such that the two backgrounds A+B (A and B together) and C are unweighted relative to each other, but separately the two backgrounds A and B are equally weighted to each other. Would this correspond to the following lines?

dataLoader -> AddBackgroundTree(treeA, (N_A + N_B) / N_A);
dataLoader -> AddBackgroundTree(treeB, (N_A + N_B) / N_B);
dataLoader -> AddBackgroundTree(treeC);

Or would this situation be represented by these following lines?

dataLoader -> AddBackgroundTree(treeA, (N_A + N_B) / (2 * N_A));
dataLoader -> AddBackgroundTree(treeB, (N_A + N_B) / (2 * N_B));
dataLoader -> AddBackgroundTree(treeC);

kialbert · March 2, 2020, 3:31pm

Hi,

Here you set the sum of the event weight for each tree to be 1 (given that the average, original, event weight is 1). To translate this situation to your described scenario one can use:

dataLoader -> AddBackgroundTree(treeA, 1 / (2*N_A));
dataLoader -> AddBackgroundTree(treeB, 1 / (2*N_B));
dataLoader -> AddBackgroundTree(treeC, 1 / (N_C));

Now, the events in treeA and treeB taken together sum to 1, as do the events in treeC. Further more the weights for events in treeA and treeB both sum to 1/2 individually.

Cheers,
Kim

jwruss · March 2, 2020, 10:06pm

Thank you for your response, Kim! Your answer has given me more to consider. If it is okay, I was hoping to clarify the last option set of lines I posted.

If trees A, B, and C were the only background trees I had, then the lines

dataLoader -> AddBackgroundTree(treeA);
dataLoader -> AddBackgroundTree(treeB);
dataLoader -> AddBackgroundTree(treeC);

would mean the corresponding event weights for each tree could be respectively considered (N_A, N_B, N_C), or perhaps those values divided by N_A + N_B + N_C, correct?

If so, then the lines

dataLoader -> AddBackgroundTree(treeA, (N_A + N_B) / (2 * N_A));
dataLoader -> AddBackgroundTree(treeB, (N_A + N_B) / (2 * N_B));
dataLoader -> AddBackgroundTree(treeC);

would mean the corresponding weights for trees A, B, and C would then respectively be ((N_A + N_B) / 2, (N_A + N_B) / 2, N_C), or these three values divided by N_A + N_B + N_C?

kialbert · March 2, 2020, 10:24pm

Yes, indeed you are correct!

May I ask why you use the tree weight (N_A + N_B) / (2 * N_A)? I would be curious as to what the advantage of that formulation is.

Cheers,
Kim

jwruss · March 2, 2020, 10:54pm

Certainly! I’m considering this formulation when the number of events (N_A, N_B, N_C) reflect the number of event occurrences in a larger but finite data set. The case is akin to N_C corresponding to actual events recorded for a certain type of background while N_A and N_B correspond to simulated events recorded for other types of backgrounds which are expected to occur in the set. The exact proportions at which N_A and N_B are to be expected to occur as actual background aren’t known, so as a first pass I want to weigh them effectively equally, such that the ratio (N_A + N_B) / N_C (their weights, actually) remains unchanged.

gsaha009 · April 26, 2020, 4:27am

Hi @kialbert,

I have a doubt regarding the background weights and I have been looking over all these posts. I was trying to use separate trees for BDT training and testing. I have signal with 5 backgrounds and there are 6 trees for training and 6 for testing purpose. How can I deal with the weights then? I mean, if I consider

dataLoader->AddBackgroundTree(B1Train, 1.0/B1TrainEntries, “Training”);
dataLoader->AddBackgroundTree(B1Test, 1.0/B1TestEntries, “Test”);
or,
dataLoader->AddBackgroundTree(B1Train, 1.0/B1TrainEntries, “Training”);
dataLoader->AddBackgroundTree(B1Test, 1.0, “Test”);
or,
dataLoader->AddBackgroundTree(B1Train, 1.0/(B1TrainEntries + B1TestEntries), “Training”);
dataLoader->AddBackgroundTree(B1Test, 1.0/(B1TrainEntries + B1TestEntries), “Test”);

Which would be more appropriate!!!

Thanks,
Gourab

kialbert · August 20, 2020, 5:32pm

Hi,

This seems like a separate issue; Please make sure to open a new topic in the future

For the case you are describing you can just use the built in (default) normalisation done automatically by TMVA. The default normalisation is approximately equal to your option 2 with the change dataLoader->AddBackgroundTree(B1Train, S1TrainEntries/B1TrainEntries, “Training”); i.e. the number of effective events in the background tree is set to the number of events in the signal class.

Note, it is also important to be careful with the weighting of the test set. Most often it is important to not reweight this as you would introduce biases to your evaluation. (I.e. option 1 and 3 are probably wrong unless you know what you are doing).

Cheers,
Kim