Evaluating MVA within Distr RDF

Hello experts,

I would like to evaluate a BDT output within the Distributed RDF. I have searched for relevant information but haven’t found any examples. In my case, the BDT model from TMVA is stored in an XML file.

Below is a toy example — this example works with a (normal) RDF but crashes when used with the Distributed RDF.

I would kindly like to ask for your advice on how to fix it.

Best,
Jindrich

import ROOT
from distributed import Client
from dask_jobqueue import SLURMCluster
import distributed

def create_remote_connection():

    python = "singularity exec /cvmfs/unpacked.cern.ch/registry.hub.docker.com/cmssw/el9:x86_64 python3"
    cluster = SLURMCluster(
        job_name="test",
        cores=1,
        memory='2GB',
        python=python
    )

    cluster.scale(2)
    cluster.adapt(minimum=0, maximum=2)
    client = Client(cluster, heartbeat_interval='5s', timeout='60s')

    print(cluster.job_script())

    return client

if __name__ == "__main__":

    client = create_remote_connection()
    print(client)


    files = ['tmva_example.root']
    NPARTITIONS = 2

    ## Distributed RDF
    RDataFrame = ROOT.RDF.Experimental.Distributed.Dask.RDataFrame
    df = RDataFrame("Events", files, daskclient=client, npartitions=NPARTITIONS)
    
    #df = ROOT.RDataFrame("Events",files)

    model = ROOT.TMVA.Experimental.RReader("TMVAClassification_BDT.weights.xml")
    variables = model.GetVariableNames()
    df = df.Define('BDT_score',ROOT.TMVA.Experimental.Compute[4, "float"](model),list(variables))
    h = df.Histo1D(("BDT_score", "BDT Score", 100, -1, 1), "BDT_score")

    h1 = df.Histo1D(("var1", "var1", 100, 0, 1), "var1")
    h2 = df.Histo1D(("var2", "var2", 100, 0, 1), "var2")
    
    # Save the results
    file = ROOT.TFile("test.root", "RECREATE")
    h.Write()
    h1.Write()
    h2.Write()
    file.Close()

The error for DistrRDF is as follows:

TBufferFile::WriteObjectAny:0: RuntimeWarning: since TMVA::Experimental::Internal::ComputeHelper<integer_sequence<unsigned long,0,1,2,3>,float,TMVA::Experimental::RReader&> has no public constructor
which can be called without argument, objects of this class
can not be read with the current library. You will need to
add a default constructor before attempting to read it.
TStreamerInfo::Build:0: RuntimeWarning: TMVA::Experimental::Internal::ComputeHelper<integer_sequence<unsigned long,0,1,2,3>,float,TMVA::Experimental::RReader&>: TMVA::Experimental::RReader& has no streamer or dictionary, data member “fFunc” will not be saved
*** Break *** segmentation violation


ROOT Version: From tags/6-34-02@6-34-02
Platform: Built for linuxx8664gcc on Jan 31 2025, 14:36:25
Compiler: g++ (GCC) 13.1.0


May be @moneta can help you ?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Dear @Jindrich ,

Thank you for reaching out to the forum, and apologies for the late reply!

I am sorry that your example did not work out of the box. The reason why it does not work for the distributed case (while it works locally) is that the RReader class is not serializable. The model object that you are passing to the Compute function used in Define will be serialized to be sent to the workers. This actually happens with any object passed to the API of a distributed execution tool (such as distributed RDataFrame). Unfortunately RReader class is currently not serializable (as vaguely hinted by the warning you report). We need to update its functionality, we’ll get back to you once that’s done.

I have created a github issue to keep track of this, see distributed RDataFrame cannot process models passed via RReader · Issue #18210 · root-project/root · GitHub