Issue about the different TMVA results when use different computer

Hai_Jiang1 · October 27, 2018, 10:11pm

Dear all TMVA experts,

My events selection randomization setup is:

dataloader->PrepareTrainingAndTestTree( mycuts, 
            "SplitMode=Random:NormMode=NumEvents:!V" );

I just found that if I used different computer(but others are exactly the same, like input and TMVAClassification.C) will have different training/testing results(like BDTG distribution etc).

But if I ran it on the same computer, no matter how many times I ran TMVAClassification.C, the results will be exactly the same.

My guess is that different computer will assign different random generator seed so that different computer will have different results, is that true?

If that’s the case, how can we determine which result we should use as they are randomly different?

Many thanks for the help in advance!
Hai

kialbert · October 29, 2018, 10:16pm

Hi,

You should be able to use the option “SplitSeed=100” to select what seed used for splitting. This should be the default however, so the results should be identical across machines.

I would ask you to provide some more details, e.g. what machines are you running on, what root versions and if possible, the output files of root -l TMVAClassification.C("BDTG") run on both machines.

Cheers,
Kim

Hai_Jiang1 · October 29, 2018, 11:16pm

Dear Kim,

Yeah, I just tried different versions of ROOT(6.10.04 and 6.14.04), and the conclusion is that it seems both the random seed and ROOT version have impact on the BDTG training.
Especially when there exists the large weight events(unluckily in our case), the training results(BDTG spectrum) will suffer from this randomization greatly.

If the seed and ROOT version are the same, the results will be exactly the same.

But I am not sure whether we fix the random seed number is proper, then I want to check how other analysis handle this issue. It should be faced by all TMVA users.

Many thanks for the help!
Hai

kialbert · October 31, 2018, 1:46pm

Hi Hai,

Across different versions of ROOT we cannot guarantee the exact same final output because a number of things touching the randomly generated numbers may have changes (but we do try our best).

The problem with having very large weights for some events is that, as you identify, skew the distribution and especially in classes with low number of events and when using randomness in the BDT training (such as subsampling).

It is always good in these situations to record the configuration of your training (in this case ROOT version and input seed, plus ofc input files etc.) so that it might be accurately reproduced later.

Cheers,
Kim

Hai_Jiang1 · October 31, 2018, 3:17pm

Dear Kialbert,

Got it! We will fix the seed and the version every time.

Many thanks!
Hai