TMVA GPU utilization

fbartels · December 13, 2019, 2:22pm

Hi all,

Unfortunately, I cannot get the GPU implementation of DNN running.

We have an NVidia RTX 2080 Super in our machine running CentOs 7. I installed the latest NVidia driver 440.33.01 instead of nouveau and CUDA 10.2. (as well as BLAS and TBB for the CPU implementation.). CUDA itself seems to work since the default NVIDIA_CUDA-10.2_Samples run fine (except those that require graphics output which we have not enabled. Their errors are apparently OpenGL-related – is OpenGL necessary for TMVA?).

I then compiled ROOT 6.18.04 from source and included the cmake flags
-Dimt=ON -Dcuda=ON -Dtmva-cpu=ON -Dtmva-gpu=ON -Dtmva=ON
as stated in the documentation.

But when I run TMVAClassification.C, it just stops after

Factory                  : Train method: DNN_GPU for Classification
                         : 
TFHandler_DNN_GPU        : Variable        Mean        RMS   [        Min        Max ]
                         : -----------------------------------------------------------
                         :   myvar1: -0.0014053    0.31630   [    -1.0000     1.0000 ]
                         :   myvar2:    0.15237    0.30014   [    -1.0000     1.0000 ]
                         :     var3:   0.043963    0.35343   [    -1.0000     1.0000 ]
                         :     var4:   0.043918    0.29454   [    -1.0000     1.0000 ]
                         : -----------------------------------------------------------
                         : Start of deep neural network training on GPU.
                         : 
TFHandler_DNN_GPU        : Variable        Mean        RMS   [        Min        Max ]
                         : -----------------------------------------------------------
                         :   myvar1: -0.0014053    0.31630   [    -1.0000     1.0000 ]
                         :   myvar2:    0.15237    0.30014   [    -1.0000     1.0000 ]
                         :     var3:   0.043963    0.35343   [    -1.0000     1.0000 ]
                         :     var4:   0.043918    0.29454   [    -1.0000     1.0000 ]
                         : -----------------------------------------------------------

I can see that it does reserve 115 MiB on the GPU but then just nothing happens for hours:

# nvidia-smi
Fri Dec 13 14:59:59 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:05:00.0 Off |                  N/A |
| 33%   35C    P8    11W / 250W |    126MiB /  7981MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     28151      C   /opt/root_v6_18/bin/root.exe                 115MiB |
+-----------------------------------------------------------------------------+

DNN_CPU runs fine for example, it is just the GPU implementation causing this problem. Is there anything that I missed in the installation? Unfortunately, I cannot find any errors in /var/log/ to hint at the problem.

Cheers,
Falk

moneta · December 16, 2019, 1:53pm

Hi,

I would need to see your macro to understand the issue. Can you please try to switch to MethodDL instead of MethodDNN when booking the neural network in TFactory::BookMethod ?

MethodDL is the new implementation supporting also convolutional and recurrent layers and it is reccomended to use it also for fully connected layers

Best regards

Lorenzo

fbartels · December 16, 2019, 3:47pm

Hi Lorenzo,

I am using the default $ROOTSYS/tutorials/tmva/TMVAClassification.C ( TMVAClassification.C (27.8 KB) ) that comes with ROOT 6.18.04 - which itself seems to include TMVA 4.2.1 . The macro already uses TMVA::Types::kDL with the option Architecture=GPU. "DNN_GPU" is apparently just an arbitrary user name.

Hope this helps,
Falk

moneta · December 17, 2019, 10:43am

Hi,

I cannot reproduce this with 6.18.04. There is probably something not right with your GPU card or the Cuda installation. When using TMVAClassification.C the training on GPU should be really fast. It is less than a second on my GPU card (RTX 2070)
You could verify CUDA by installing and compiling the CUDA sample programs.
They are normally in /usr/local/cuda/samples
See https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html for their location.
You normally need to just type make in the sample directory (after having copied them in a writeable directory)

Lorenzo