I tried to use distributed ROOT dataframe in SWAN with spark cluster.
These are the steps I followed:
- login to SWAN. Selecting analytix cluster
- Once in the jupyter notebook, clicking on the star button to connect to the spark cluster including “EOS system” option and the latest software stack.
- Running the code snippets as below:
import ROOT
RDataFrame = ROOT.RDF.Experimental.Distributed.Spark.RDataFrame
df = RDataFrame("Events",
"root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root",
npartitions=2,
sparkcontext=sc)
df.Count().GetValue()
I get an error
File "/cvmfs/sft.cern.ch/lcg/views/LCG_103swan/x86_64-centos7-gcc11-opt/lib/ROOT/_pythonization/_tmva/__init__.py", line 25, in <module>
hasRDF = gSystem.GetFromPipe("root-config --has-dataframe") == "yes"
ValueError: TString TSystem::GetFromPipe(const char* command) =>
ValueError: nullptr result where temporary expected
I posted this incident on the CERN service portal and I got a comment form an expert:
I found out that the notebook runs fine if you select, when you are about to start your SWAN session, the software stack called “101” – it is in the list of “Other releases” if you scroll down. That LCG release has ROOT 6.24, so it seems that the issue was introduced in newer ROOT releases
I would like to shed light on this problem which might be caused in the latest ROOT release. I hope the information is enough.
Regards,
Nilima.