Hi all,
Unfortunately, I cannot get the GPU implementation of DNN running.
We have an NVidia RTX 2080 Super in our machine running CentOs 7. I installed the latest NVidia driver 440.33.01 instead of nouveau and CUDA 10.2. (as well as BLAS and TBB for the CPU implementation.). CUDA itself seems to work since the default NVIDIA_CUDA-10.2_Samples run fine (except those that require graphics output which we have not enabled. Their errors are apparently OpenGL-related – is OpenGL necessary for TMVA?).
I then compiled ROOT 6.18.04 from source and included the cmake flags
-Dimt=ON -Dcuda=ON -Dtmva-cpu=ON -Dtmva-gpu=ON -Dtmva=ON
as stated in the documentation.
But when I run TMVAClassification.C, it just stops after
Factory : Train method: DNN_GPU for Classification
:
TFHandler_DNN_GPU : Variable Mean RMS [ Min Max ]
: -----------------------------------------------------------
: myvar1: -0.0014053 0.31630 [ -1.0000 1.0000 ]
: myvar2: 0.15237 0.30014 [ -1.0000 1.0000 ]
: var3: 0.043963 0.35343 [ -1.0000 1.0000 ]
: var4: 0.043918 0.29454 [ -1.0000 1.0000 ]
: -----------------------------------------------------------
: Start of deep neural network training on GPU.
:
TFHandler_DNN_GPU : Variable Mean RMS [ Min Max ]
: -----------------------------------------------------------
: myvar1: -0.0014053 0.31630 [ -1.0000 1.0000 ]
: myvar2: 0.15237 0.30014 [ -1.0000 1.0000 ]
: var3: 0.043963 0.35343 [ -1.0000 1.0000 ]
: var4: 0.043918 0.29454 [ -1.0000 1.0000 ]
: -----------------------------------------------------------
I can see that it does reserve 115 MiB on the GPU but then just nothing happens for hours:
# nvidia-smi
Fri Dec 13 14:59:59 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:05:00.0 Off | N/A |
| 33% 35C P8 11W / 250W | 126MiB / 7981MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 28151 C /opt/root_v6_18/bin/root.exe 115MiB |
+-----------------------------------------------------------------------------+
DNN_CPU runs fine for example, it is just the GPU implementation causing this problem. Is there anything that I missed in the installation? Unfortunately, I cannot find any errors in /var/log/ to hint at the problem.
Cheers,
Falk