How are weighted samples treated in multiclass TMVA?

jwruss · January 1, 2021, 8:53pm

Hello,

I have a follow up question stemming from the discussion in this thread.

Suppose one of the classification samples has n0 events, with each event having an importance weight (w_i) between 0 to 1. Also suppose I have at least 2 other classification samples with n1 and n2 events, but no importance weight assigned to them. For all intents and purposes the n1 and n2 events in each of these samples can then be thought of all have importance weights of 1.

The question is this: If I want to consider the significance weights in addition to sample sizes of these three samples, then when I would implement the lines

dataloader.AddSignalTree(& chain0);
dataloader.AddBackgroundTree(& chain1, n0 / n1);
dataloader.AddBackgroundTree(& chain2, n0 / n2);

should I replace n0, n1, n2 in the weights n0 / n1 and n0 / n2 with the sum of the weights in each corresponding sample? Or should I replace n0, n1, n2 with the corresponding Kish’s Effective Sample Sizes? Or should I keep n0, n1, n2 as they initially are?

Initially I thought that I should just sum the weights together like in the first instance after a colleague suggested it, but now I’m wondering if I should use the second instance instead after the same colleague later pointed out that the sum of weights for a given class isn’t necessarily directly equivalent to the effective number of events in that class. In each instance asked about here the n1 and n2 would not be changed in the weights n0 / n1 and n0 / n2, but n0 would.

jalopezg · January 3, 2021, 6:22am

Hi @jwruss,

I am inviting @moneta to this thread. He may be able to help you with your question.

Cheers,
J.

jwruss · January 17, 2021, 10:50pm

Much appreciated @jalopezg.

I have yet to hear a response @moneta or @kialbert regarding this, so I thought I would try and further clarify the question with one of the macros I’m using that this applies to:

trainStokesFisherOvRResponse.C (42.3 KB)

In the macros like the one linked above, I use regular, binary TMVA to construct Fisher discriminant multiclass responses, rather than use TMVA MultiClass FDA. The reason I do this is because when I tried using FDA, the coefficients in the resulting weight files were saved as NaN values, in disagreement with the saved Fisher_tr_S TH1D objects. I also saved the responses for each classification to separate ROOT binaries. This is because the TrainTree and TestTree objects in the binary only seemed to save the response values for the first classification trained with.

But the above paragraph only serves to explain some of the quirks to the macros. The real question is if using effective sample size (such as defined between lines 49-56 in the linked macro) is more reasonable than using the larger actual number of classification samples or the smaller sum of weights (each event weight ranges between 0 and 1) over a classification sample? Does MultiClass TMVA use something like sum of weights or effective sample size when weights are present, or are actual sample sizes still used?

To use the terminology used here, when weighting classifications based on sample size, but each sample has an additional individual weight, should the classification weights go from using sample sizes (n) to using effective sample sizes (n_e) or sum of weights (W)?

moneta · January 18, 2021, 10:19am

Hi,
Sorry for my late reply.
I think if you are doing a binary classification as the one in the linked macro, I think the overall weight of signal vs background trees is not relevant. It is important the relative weights of the different background trees, if you want to scale then to the same luminosity.

If the background samples are weighted and you want to scale them to the same luminosity, I think what should be used is the effective sample size (N_e)

Best regards

Lorenzo

jwruss · January 18, 2021, 6:00pm

Thank you, Lorenzo, that’s what I was hoping to hear!

Best,
John