Hello Experts!
I am trying to use RDataFrame with Dask backend on the CERN lxplus machines. When doing a simple test, instantiating an RDataFrame from an RDatasetSpec object, I see the following when trying to use a Dask client:
Traceback (most recent call last):
File "/eos/home-g/gwmyers/analysis-storage/dihiggs/postprocessing/run/test.py", line 28, in <module>
df_2 = RDataFrame(rds, daskclient=client)
File "/cvmfs/sft.cern.ch/lcg/views/LCG_105/x86_64-el9-gcc11-opt/lib/DistRDF/Backends/Dask/__init__.py", line 23, in RDataFrame
return daskbackend.make_dataframe(*args, **kwargs)
File "/cvmfs/sft.cern.ch/lcg/views/LCG_105/x86_64-el9-gcc11-opt/lib/DistRDF/Backends/Dask/Backend.py", line 369, in make_dataframe
headnode = HeadNode.get_headnode(self, npartitions, *args)
File "/cvmfs/sft.cern.ch/lcg/views/LCG_105/x86_64-el9-gcc11-opt/lib/DistRDF/HeadNode.py", line 276, in get_headnode
raise RuntimeError(
RuntimeError: First argument <cppyy.gbl.ROOT.RDF.Experimental.RDatasetSpec object at 0xb07c850> of type <class cppyy.gbl.ROOT.RDF.Experimental.RDatasetSpec at 0x47c2170> is not recognised as a supported argument for distributed RDataFrame. Currently only TTree/Tchain based datasets or datasets created from a number of entries can be processed distributedly.
Below is a test script that instantiates two RDataFrame objects: the first using RDataFrame
constructor and the second using the Experimental.Distributed.Dask.RDataFrame
. The second case gives the error above.
import ROOT as R
from dask.distributed import Client
from dask_lxplus import CernCluster
# rdatasetspec setup:
meta = R.RDF.Experimental.RMetaData()
meta.Add("sample_name", "name")
rsample = R.RDF.Experimental.RSample("s", "AnalysisMiniTree", "test.root", meta)
rds = R.RDF.Experimental.RDatasetSpec()
rds.AddSample(rsample)
# a vanilla rdatafame:
RDataFrame = R.RDataFrame
df_1 = RDataFrame(rds)
print(df_1)
# fancy distributed rdataframe:
RDataFrame = R.RDF.Experimental.Distributed.Dask.RDataFrame
df_2 = RDataFrame(rds, daskclient=Client(CernCluster()))
print(df_2)
My simple takeaway is that the distributed backend features for RDataFrame cannot be used when also using RDatasetSpec at the moment? Is there any plan for supporting this in the future? Both are very nice features - it would be great to have them work in harmony .
Thanks,
Greg
ROOT Version: 6.30/02
Platform: linuxx8664gcc
Compiler: g++ (GCC) 11.3.0