Problems with running TMVA MLP on different machines

ChrLau · September 23, 2021, 3:06am

Hi !

I am trying to do regression with MLP and found a little inconsistency in the training output between running the code interactively and submitting batch jobs.

I initially have eventsampling turned on. The training outputs from different running environments have noticable fluctuations. This fluctuation averages out in general and is pretty much negligible.
After turning off the eventsampling, the fluctuation is greatly reduced, but the outputs are still not 100% the same and have very very tiny differences. For instance, one of the weight values is -0.350179896 for batch jobs and -0.350154189 for interactive mode, where the difference starts at the fifth decimal place. All these results can be reproduced on the same machine. I have tested other algorithms like LD and KNN, and found no problems in all these algorithms. So looks like this happens only in MLP, and can be summerized in two aspects:

event sampling have different choices of events on different machines
with event sampling OFF, the training outputs are a bit different on different machines

Here is the setup I am using:

// event sampling ON
   if (Use["MLP"])
      factory->BookMethod( TMVA::Types::kMLP, "MLP", "!H:!V:VarTransform=Norm:NeuronType=tanh:NCycles=10000:HiddenLayers=N+100:EstimatorType=MSE:TestRate=10:LearningRate=0.02:NeuronInputType=sum:DecayRate=0.6:TrainingMethod=BFGS:Sampling=0.1:SamplingEpoch=0.8:ConvergenceImprove=1e-6:ConvergenceTests=15:!UseRegulator");

// event sampling OFF
   if (Use["MLP"])
        factory->BookMethod( TMVA::Types::kMLP, "MLP", "!H:!V:VarTransform=Norm:NeuronType=tanh:NCycles=10000:HiddenLayers=N+100:EstimatorType=MSE:TestRate=10:LearningRate=0.02:NeuronInputType=sum:DecayRate=0.6:TrainingMethod=BFGS:SamplingTraining=kFALSE:SamplingTesting=kFALSE:ConvergenceImprove=1e-6:ConvergenceTests=15:!UseRegulator");

I wonder if this is related to float precision or something?

Thanks

moneta · September 24, 2021, 12:59pm

Hi,
To be clear, are you getting the difference by running your training macro from your ROOT prompt using Cling or compiling it (e…g. using root-config) as a stand-alone application using ROOT ?
This is weird and it would be nice to have the code reproducing this problem.

Note that you need to be sure you start from the same clean state, for example random number generators is used internally and if you in the interactive case you are doing something else before, you might have different sequence of random numbers generated.

Lorenzo

ChrLau · September 24, 2021, 2:00pm

Hi Lorenzo,

The code is compiled as an executible using the regular ROOT compiling configuration (root-config). It reads some inputs for making cuts. The problem was found when running the executible both interactively and through batch jobs.
I have attached the code that can reproduce this problem.

Thanks,
HZ
TMVARegression_example.cxx (9.2 KB)

moneta · September 24, 2021, 8:11pm

Thank you for the code. Can you also share the input files so I can try to run it ?
Thanks

Lorenzo

ChrLau · September 27, 2021, 5:40pm

I tried to upload the input files but it seems they exceeded the file size limit.
I did some additional studies, and I found that when the event number was relatively lower, this problem disappeared, so it seems to only happen for large event sample (I have ~180k events in the sample).

moneta · September 28, 2021, 7:58am

Hi,
You don;t need to upload the files, you can just share a link (e.g. cernbox ).
If the issue happens only with a large dataset is probably an indication of a numerical precision problem.

Lorenzo

ChrLau · September 29, 2021, 1:39am

Hi Lorenzo,

I have uploaded two files onto googledrive.
https://drive.google.com/drive/folders/1rl8H0Sb10QNSIVtYc7Z2mTXmocaJQq51?usp=sharing

The one tagged with “2X” has more statistics and can reproduce the problem. The one tagged with “1X” has half of the statistics and free from the problem.
To run the code, I would recommend execute: " TMVA_example MLP 0 1 1 pathtoinputfile outputdir" as the configuration “0 1 1” select the least number of events for training and therefore runs faster.

Thanks,
HZ