TMVA DNN: different result CPU vs GPU

Dear ROOT team,

I am currently working on a TMVA regression training with a DNN. When running the DNN with Architecture=CPU, I get excellent performance (in terms of prediction quality). However, it takes several hours to train the network on a 6 core machine. So I have recompiled ROOT with -Dcuda=on for better run-time performance. I am using a Geforce GTX 1060 card. And indeed, the training is super fast (takes only 90 seconds). BUT: Using Architecture=GPU, the result is much worse in terms on prediction quality (i.e. not good at all).

ROOT version is 6.14.06

You can already see the difference in the output. For CPU, everthing looks fine:

: Start of neural network training on CPU.
: 
: Training phase 1 of 2:
:      Epoch |   Train Err.  Test  Err.     GFLOP/s Conv. Steps
: --------------------------------------------------------------
:          7 |   0.00844188  0.00835884     11.5233           0
:         14 |   0.00771477   0.0076277     11.4086           0
:         21 |   0.00719746  0.00711652     11.5553           0
:         28 |   0.00695399  0.00686811     11.6709           0
:         35 |   0.00677531   0.0067119     11.3855           0
:         42 |   0.00663051    0.006569     11.0283           0
[...]
:       1330 |   0.00461922  0.00482107     11.2672          77
:       1337 |   0.00460131  0.00480657      11.778          84
:       1344 |   0.00464281  0.00484956     11.8991          91
:       1351 |   0.00464017  0.00484474     11.5172          98
:       1358 |   0.00460107  0.00480589     11.3244         105

For GPU, we have negative values (?!) for Train Err. and Test Err. and it seems the number of conversion steps is already high in the beginning. Could it be there is some sort of abs missing somewhere? Also, instead of 1358 epochs for phase 1, there are only 168 when using GPU:

: Training phase 1 of 2:
:      Epoch |   Train Err.  Test Err.     GFLOP/s Conv. Steps
: --------------------------------------------------------------
:          7 |   -0.0519101 -0.0513511     272.042           0
:         14 |   -0.0518897 -0.0512095     278.155           7
:         21 |   -0.0519059 -0.0509986     277.483          14
:         28 |   -0.0518894 -0.0508399     277.738          21
:         35 |   -0.0518891 -0.0503983     277.846          28
:         42 |   -0.0518917 -0.0513722     277.627           0
:         49 |   -0.0518792 -0.0513415     277.956           0
:         56 |   -0.0519018 -0.0513527     277.596           0
:         63 |   -0.0518969 -0.0518492     277.697           0
:         70 |   -0.0519031 -0.0506192     277.909           7
:         77 |   -0.0518938 -0.0512583     277.788          14
:         84 |   -0.0519034 -0.05101       277.565          21
:         91 |   -0.0518935 -0.0514375     277.888          28
:         98 |   -0.0518891 -0.0510831     277.701          35
:        105 |   -0.0518903 -0.0514737     277.602          42
:        112 |   -0.0518993 -0.0512396     277.284          49
:        119 |   -0.0519075 -0.0508297     276.154          56
:        126 |    -0.051904 -0.0514084     276.909          63
:        133 |   -0.0518718 -0.0509068     276.187          70
:        140 |   -0.0518997 -0.0512256     276.347          77
:        147 |   -0.0519123 -0.0517199     276.478          84
:        154 |   -0.0518909 -0.0515149     275.244          91
:        161 |   -0.0518925 -0.0511171     277.766          98
:        168 |   -0.0518953 -0.0516251     277.442         105

In case it is relevant: I have compiled ROOT/Cuda with g++ 7 (cxx14=on, python=3) while Cuda officially only supports g++ <= 6. I did so by removing the version check in the cuda header. When compiling ROOT, a few “unused parameter” warnings appeared in the some TMVA/DNN files.

Hi @behrenhoff,

Indeed it is suspicious that the output differ so much between the two implementations, and especially that the loss is negative in the GPU case. This sugests that event weights might not be correctly handled in the regression loss function for the GPU.

Please see the newly created JIRA-ticket and thanks for reporting!

Thanks for looking at it without me providing a full reproducer (I cannot share the data, unfortunately). Meanwhile, I have upgraded to the most recent Cuda version - which gives around 285 GFLOP/s (so it is faster and supports gcc7 without me patching it) but the result is still negative. I will follow the Jira ticket. Please ping me if you need a reproducer (would take me some time to create fake data I can share - certainly not before xmas). Note: I am not calling SetWeightExpression at all.

Ah, thanks for that clarification. Would you have the time to check if you can replicate the issue using the TMVARergession example?

Cheers,
Kim

Hi,

There are might be some bugs present in version 6.14. 6.16 contains much larger developments in the DL module of TMVA. I would strongly recommend to use the MethodDL, (method=kDL), with ROOT 6.16 version, or the master.
If you still have an issue there, please let us know

Lorenzo

Thanks, I’ll give it a try on Monday.

Edit: the string “kDL” does not exist in the TMVA User’s Guide. Does it work exactly like DNN?

HI,

You can see this notebook as example for MethodDL

https://github.com/lmoneta/tmva-tutorial/blob/master/tutorial_Desy/TMVA_Higgs_Classification.ipynb

Lorenzo