Problem with CUDA acceleration for RooAbsPdf.fitTo

Dear Roofit experts,

I am trying to apply the new CUDA acceleration in my Roofit code, but it is somehow printing errors. The ROOT version is ‘6.32.02’ and here’s my small reproducer code:

import ROOT
from ROOT import RooFit, RooRealVar, RooGaussian, RooDataSet, RooFitResult, RooAbsReal

def fit_with_cuda():
    # Define observable
    x = RooRealVar("x", "x", -10, 10)

    # Define parameters
    mean = RooRealVar("mean", "mean", 0, -10, 10)
    sigma = RooRealVar("sigma", "sigma", 1, 0.1, 10)

    # Define Gaussian PDF
    gauss = RooGaussian("gauss", "gaussian PDF", x, mean, sigma)

    # Generate a toy dataset
    data = gauss.generate(ROOT.RooArgSet(x), 100)
    data.Print("v")
    
    # Perform the fit
    result = gauss.fitTo(data, RooFit.Save(), EvalBackend="cuda")
    # Print results
    result.Print()

# Run the fit function
fit_with_cuda()

This code returns with:

[#1] INFO:Fitting -- RooAbsPdf::fitTo(gauss_over_gauss_Int[x]) fixing normalization set for coefficient determination to observables in data
[#1] INFO:Fitting -- using CPU computation library compiled with -mavx2
[#1] INFO:Fitting -- using CUDA computation library
Traceback (most recent call last):
  File "/work/users/yun79/valerie/fork/copperheadV2/quick_tests/quic_cuda_test.py", line 35, in <module>
    fit_with_cuda()
  File "/work/users/yun79/valerie/fork/copperheadV2/quick_tests/quic_cuda_test.py", line 29, in fit_with_cuda
    result = gauss.fitTo(data, RooFit.Save(), EvalBackend="cuda")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/depot/cms/kernels/root632/lib/python3.12/site-packages/ROOT/_pythonization/_roofit/_rooabspdf.py", line 62, in fitTo
    return self._fitTo["RooLinkedList const&"](args[0], _pack_cmd_args(*args[1:], **kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
cppyy.gbl.std.runtime_error: Could not find "fitTo<RooLinkedList const&>" (set cppyy.set_debug() for C++ errors):
  RooFitResult* RooAbsPdf::fitTo(RooAbsData& data, const RooLinkedList& cmdArgs) =>
    runtime_error: copyHostToDeviceImpl(), /depot/cms/purdue-af/roofit-batchcompute/src/CudaInterface.cu:160 : invalid argument

I took a quick look at line 160 of CudaInterface.cu, which is ERRCHECK(cudaMemcpy(dest, src, nBytes, cudaMemcpyHostToDevice));, implying that moving the data on system memory to GPU memory is failing. I would appreciate any help on this matter.

Thank you in advance!

Hyeon-Seo

Dear Hyeon-Seo,

Thanks for the clear post and welcome to the ROOT Community!
I am sorry to read ROOT did not work out of the box for you in this case.
Could you share with us what ROOT binaries are you using (e.g. build from sources on your machine, cvmfs, conda, other binary distributions)?
I also put in the loop @jonas , our RooFit expert.

Cheers,
D

Dear @Danilo ,

Thank you for your quick response. I am using ananda environment from Purdue Analysis Facility. @kondratyevd is in charge of this, so I believe he can better answer your question in terms of ROOT binaries.

Best,

Hyeon-seo

@Danilo the setup is ROOT 6.32 installed via Conda with BatchCompute library added on top, as we have been discussing in this thread.
Hyeon-Seo is trying to test this setup, so his question is about how to use EvalBackend argument correctly.

Hi! Yes, the code is correct, and the crash is unexpected. Something seems wrong with how the GPU is used.

Maybe one question first, because my first suspicion is related to that. In the other thread, I recommended using set (CMAKE_CUDA_ARCHITECTURES "native" CACHE STRING "" FORCE). But then, the code is especially compiled for the CUDA architecture of the GPU on the host that is compiling the code. See also the CMake docs to CUDA_ARCHITECTURES.

Maybe the CUDA architecture on the machine you’re running the code on doesn’t match? Which GPU are you using on the machines that build and run the code, if they are not the same?

Thanks a lot for trying out the RooFit CUDA feature is such an interesting environment!

Cheers,
Jonas

Hi @jonas, the GPU model and CUDA version is the same both during building of the library and in Hyeon-Seo’s tests. We have also confirmed that $LD_LIBRARY_PATH in his case contains correct paths to CUDA libraries and to newly built BatchCompute library.

Here is nvidia-smi output:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  | 00000000:81:00.0 Off |                   On |
| N/A   47C    P0              74W / 250W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

Thanks for the info!

Sorry I fail to reproduce the problem then :frowning:

I just tried also on lxplus-gpu from CERN (where ROOT 6.32.02 is pre-installed). My instructions worked out of the box, modulo adjusting come compiler paths:

# Adapt to your system
set (CMAKE_CUDA_ARCHITECTURES "native" CACHE STRING "" FORCE)
set (CMAKE_CUDA_HOST_COMPILER /usr/bin/g++-13 CACHE STRING "" FORCE)
set (CMAKE_CUDA_COMPILER "/opt/cuda/bin/nvcc" CACHE STRING "" FORCE)

find_package(ROOT REQUIRED)
include(${ROOT_USE_FILE})

cmake_minimum_required(VERSION 3.14)
project(batchcompute-cuda LANGUAGES CUDA)

# in the src directory, put all files from roofit/batchcompute/src and roofit/batchcompute/res
add_library(RooBatchCompute_CUDA SHARED src/RooBatchCompute.cu src/ComputeFunctions.cu src/CudaInterface.cu)
target_include_directories(RooBatchCompute_CUDA PRIVATE src res)

target_compile_options(RooBatchCompute_CUDA  PRIVATE -lineinfo --expt-relaxed-constexpr)

And then running the Python script from @green-cabbage works just fine.

So it’s unfortunately hard to make a diagnosis from my point. Do you have the possibility to do the same with a debug build of ROOT instead of the conda build, so we can properly debug this?

Thank you for checking, it is encouraging to know that the code itself is working. I tried to dig more into this, here are some debugging results:

  1. The following doesn’t work out of the box due to missing path to headers:
print(ROOT.RooBatchCompute.initCUDA())
> AttributeError: <namespace cppyy.gbl.RooBatchCompute at 0x55bcc79eb020> has no attribute 'initCUDA'. 
  1. ROOT.gInterpreter.AddIncludePath didn’t help, but I managed to include the relevant header explicitly. Now initCUDA returned “0”:
ROOT.gInterpreter.ProcessLine('#include "/depot/cms/purdue-af/roofit-batchcompute/src/RooBatchCompute.h"')
print(ROOT.RooBatchCompute.initCUDA())
> 0
  1. Repeated loading of a library yields different values:
print(ROOT.gSystem.Load("libRooBatchCompute_CUDA"))
print(ROOT.gSystem.Load("libRooBatchCompute_CUDA"))
> 0
> 1
  1. So, by loading this library before Hyeon-Seo’s code, I can trick initCUDA into returning “1” which triggers the error "Can't create Evaluator in CUDA mode because RooBatchCompute CUDA could not be loaded!" in Evaluator::Evaluator. This indirectly tells us that useGPU=True and that before my hacks initCUDA returned “0”, which means that there were no immediate issues with loading our newly built library and initializing CUDA backend.

However, so far I didn’t get closer to finding the source of the original error

  RooFitResult* RooAbsPdf::fitTo(RooAbsData& data, const RooLinkedList& cmdArgs) =>
    runtime_error: copyHostToDeviceImpl(), /depot/cms/purdue-af/roofit-batchcompute/src/CudaInterface.cu:160 : invalid argument

More debugging, now addressing the invalid argument error:

  • The error originates in this line in RooBatchCompute.cu
  • input.data() points to 0x64 and input.size() is 93930851507920 - clearly, the input is invalid. input.data() then gets passed as src to copyHostToDeviceImpl() in CudaInterface.cu, triggering the error.
  • Tracking it down to RooFit Evaluator, the error most likely appears in this line, where cpuSpan is for some reason invalid.