Hello root team and CERN community. Now that 6.24 is officially out and there are some release candidates with LCG stack including it, I’d like to start testing this for my framework.
Is there a ~minimal example (not making use of SWAN) for using the distributed RDF, setting up and connecting to one of the spark clusters at CERN (analytix or cloudcontainers), loading C++ .h/.cc files, and running over some inputs to produce histograms?
Hello @nmangane,
We have a simple tutorial available here and I can point you to the RDataFrame guide, you can find an introduction on the topic in the “Distributed execution in Python” section.
For specifics of the connection to the clusters at CERN I feel you can refer to the SWAN docs
Hope this helps!
Cheers,
Vincenzo
That’s good, for the tutorial. Is there a special procedure for including C++ code from headers/source files? Since the code is eventually executed in distributed manner.
I explicitly cannot use SWAN (it’s not yet suitable for a production analysis workflow, as nice as it can be for experimentation; I’m looking forward to the JupyterLab interface and other improvements that are coming).
That’s good, for the tutorial. Is there a special procedure for including C++ code from headers/source files? Since the code is eventually executed in distributed manner.
Keeping in mind that the aim of this package is to allow you to distribute your (Python) RDataFrame application, in the traditional RDataFrame API you can give valid c++ code as strings arguments to Filter or Define for example, including c++ functions that you declared in some header. This usecase is supported, but keep in mind that the interface will definitely change. That said, here’s a quick example of what you could do.
Create a cpp header , for example myheader.hpp
#ifndef myheader
#define myheader
bool check_number_less_than_5(int num){
// Checks if the input number is less than 5
return num < 5;
}
#endif // myheader
Then you can distribute it to the spark workers directly in your distributed RDataFrame script, for example
import ROOT
SparkRDataFrame = ROOT.RDF.Experimental.Distributed.Spark.RDataFrame
# Create a distributed RDataFrame that will run on Spark
df = SparkRDataFrame(100)
# Signal header files needed in the application
# They will be sent to the Spark executors later
# NOTE THAT THIS INTERFACE IS SUBJECT TO CHANGE
df._headnode.backend.distribute_headers("myheader.hpp")
# Proceed with your analysis
# Functions written in myheader.hpp will be available in ROOT
# This filters out all numbers less than 5
df_filtered = df.Filter("check_number_less_than_5(rdfentry_)")
histo = df_filtered.Histo1D(("distributeheader","distributeheader",10,0,100), "rdfentry_")
c = ROOT.TCanvas("", "", 600, 600)
histo.Draw()
c.Draw()
Again, the interface is very experimental, but the support for the functionality is definitely planned and intended. If you find any issues please let me know, would be deeply appreciated
Cheers,
Vincenzo