About running TMVA on different OS

JungChang · January 10, 2019, 7:02am

Dear Sir,

I had run with TMVA BDT with the configure as
NTrees=800:MinNodeSize=2.5%:MaxDepth=4:BoostType=AdaBoost:AdaBoostBeta=0.5:UseBaggedBoost:BaggedSampleFraction=0.5:SeparationType=GiniIndex:nCuts=20

I got different BDT response shape from two OS.

May I ask usually its normal? And how can we estimate uncertainty?

Thanks!

Best,
Jung

amadio · January 10, 2019, 7:55am

Small differences are definitely possible, since there are many things that can be different between OSs, for example, math function implementations in the standard C library, rounding mode for floating point numbers, the system compiler (GCC / Clang), default compilation flags (in particular optimization options), etc. However, since the background test sample looks different in the two plots, I think the difference could be due something like using a different random seed somewhere for the calculations, for example, for selecting what is part of the test and training samples. Cheers,

JungChang · January 10, 2019, 1:01pm

Dear Amadio,

Thanks for the reply.
We use this setting to separate test and train sample, nTrain_Signal=0:nTrain_Background=0:SplitMode=Random:NormMode=None

Is it means the BDT output is highly depend on these?

p.s. if I train many times on same system the output is similar.

Best,
Jung

amadio · January 10, 2019, 1:02pm

I think this is what is causing variation between the runs.

kialbert · January 10, 2019, 1:55pm

Hi,

As @amadio is saying, SplitMode=Random activates random splitting of the data. The seed of the splitting is controlled with SplitSeed=100. (The default is 100).

To determine whether it’s the random splitting causing the difference I would ask you to rerun the training with SplitMode=Alternate or SplitMode=Block. These two modes are not recommended for trainings used in production, but should be good for determining the cause here.

Cheers,
Kim

JungChang · January 10, 2019, 2:07pm

Dear Kim,

Thanks for the reply, I will try.

I have another question, is it mean the BDT is sensitive even to the random seed?

Then how can I decide the BDT result uncertainty?

Thanks!

Best,
Justine

kialbert · January 10, 2019, 3:35pm

The BDT training is sensitive to the random seed yes. In application the results should be identical barring subtle differences in the floating-point implementation etc.

To determine the uncertainty of the BDT response one can use cross validation. The idea is then to train the classifier repeatedly using (slightly) different training and test samples. This will give several reasonably independent distributions of the response with which one can derive some measures of uncertainty.

Cheers,
Kim

JungChang · January 10, 2019, 4:21pm

Dear Kim,

Thanks for the reply, I got it.

Best,

Justine

Kim Albertsson root.discourse@cern.ch於 2019年1月10日週四，下午11:45寫道：

JungChang · January 14, 2019, 3:10am

Dear Kim,

We tested the Splitmode as you suggested as the attached file Left is Linux system, Right is OS X system.
Only the Random mode shown difference. Will it due to number of events not enough so the statistic fluctuation is high?

Tanks,

Best,
Jung

kialbert · January 14, 2019, 2:56pm

Hi,

I’m not sure I exactly understand the question but basically that there is variance across trainings on different machines is not a major issue. As long as the application phase produces identical values across different machines that is enough (train once and then you should be able to trust that output).

If I understand your question correctly, then yes the differences should go away as your data sample size is increased.

Maybe @moneta wants to add something to this discussion.

Cheers,
Kim