Using RDataFrame with Dask and RDatasetSpec

Hello Experts!

I am trying to use RDataFrame with Dask backend on the CERN lxplus machines. When doing a simple test, instantiating an RDataFrame from an RDatasetSpec object, I see the following when trying to use a Dask client:

Traceback (most recent call last):
  File "/eos/home-g/gwmyers/analysis-storage/dihiggs/postprocessing/run/test.py", line 28, in <module>
    df_2 = RDataFrame(rds, daskclient=client)
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_105/x86_64-el9-gcc11-opt/lib/DistRDF/Backends/Dask/__init__.py", line 23, in RDataFrame
    return daskbackend.make_dataframe(*args, **kwargs)
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_105/x86_64-el9-gcc11-opt/lib/DistRDF/Backends/Dask/Backend.py", line 369, in make_dataframe
    headnode = HeadNode.get_headnode(self, npartitions, *args)
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_105/x86_64-el9-gcc11-opt/lib/DistRDF/HeadNode.py", line 276, in get_headnode
    raise RuntimeError(
RuntimeError: First argument <cppyy.gbl.ROOT.RDF.Experimental.RDatasetSpec object at 0xb07c850> of type <class cppyy.gbl.ROOT.RDF.Experimental.RDatasetSpec at 0x47c2170> is not recognised as a supported argument for distributed RDataFrame. Currently only TTree/Tchain based datasets or datasets created from a number of entries can be processed distributedly.

Below is a test script that instantiates two RDataFrame objects: the first using RDataFrame constructor and the second using the Experimental.Distributed.Dask.RDataFrame. The second case gives the error above.

import ROOT as R
from dask.distributed import Client
from dask_lxplus import CernCluster

# rdatasetspec setup:
meta = R.RDF.Experimental.RMetaData()
meta.Add("sample_name", "name")
rsample = R.RDF.Experimental.RSample("s", "AnalysisMiniTree", "test.root", meta)
rds = R.RDF.Experimental.RDatasetSpec()
rds.AddSample(rsample)

# a vanilla rdatafame:
RDataFrame = R.RDataFrame
df_1 = RDataFrame(rds)
print(df_1)

# fancy distributed rdataframe:
RDataFrame = R.RDF.Experimental.Distributed.Dask.RDataFrame
df_2 = RDataFrame(rds, daskclient=Client(CernCluster()))
print(df_2)

My simple takeaway is that the distributed backend features for RDataFrame cannot be used when also using RDatasetSpec at the moment? Is there any plan for supporting this in the future? Both are very nice features - it would be great to have them work in harmony :grinning:.

Thanks,
Greg


ROOT Version: 6.30/02
Platform: linuxx8664gcc
Compiler: g++ (GCC) 11.3.0


Hi @gwmyers,

thanks for reaching out. It’s great to hear you’re enjoying our features!

Your guess is correct - RDatasetSpec is not yet available in the Distributed RDF - we have it on our to do list of items and we will get to it soon.

Cheers,
Marta

Hi @mczurylo ,

Thanks! I took a stab at implementing this (thanks @Jordy_Degens for pointing me to the code) and opened PR. If you have any comments/advice/sugestions, I can try to follow them up.

Cheers,
Greg

1 Like

Hi @gwmyers,

that’s super nice, thanks a lot! We will take a look at your PR and will follow up directly on GH.

Have a nice day,
cheers,
Marta

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.