Using ROOT.RDF.Experimental.Distributed with spark

Hello root team and CERN community. Now that 6.24 is officially out and there are some release candidates with LCG stack including it, I’d like to start testing this for my framework.

Is there a ~minimal example (not making use of SWAN) for using the distributed RDF, setting up and connecting to one of the spark clusters at CERN (analytix or cloudcontainers), loading C++ .h/.cc files, and running over some inputs to produce histograms?

Cheers,
Nick

ROOT Version: 6.24/00
Platform: LCG_100rc2
Compiler: gcc8-opt


Hello @nmangane,
We have a simple tutorial available here and I can point you to the RDataFrame guide, you can find an introduction on the topic in the “Distributed execution in Python” section.

For specifics of the connection to the clusters at CERN I feel you can refer to the SWAN docs
Hope this helps!
Cheers,
Vincenzo

1 Like

That’s good, for the tutorial. Is there a special procedure for including C++ code from headers/source files? Since the code is eventually executed in distributed manner.

I explicitly cannot use SWAN (it’s not yet suitable for a production analysis workflow, as nice as it can be for experimentation; I’m looking forward to the JupyterLab interface and other improvements that are coming).

The Hadoop admins pointed me to these docs, however, so others similarly unaware of them can also take a look:
https://hadoop-user-guide.web.cern.ch/spark/Using_Spark_on_Hadoop.html

Cheers,
Nick

That’s good, for the tutorial. Is there a special procedure for including C++ code from headers/source files? Since the code is eventually executed in distributed manner.

Keeping in mind that the aim of this package is to allow you to distribute your (Python) RDataFrame application, in the traditional RDataFrame API you can give valid c++ code as strings arguments to Filter or Define for example, including c++ functions that you declared in some header. This usecase is supported, but keep in mind that the interface will definitely change. That said, here’s a quick example of what you could do.

  1. Create a cpp header , for example myheader.hpp
#ifndef myheader
#define myheader

bool check_number_less_than_5(int num){
    // Checks if the input number is less than 5
    return num < 5;
}

#endif // myheader

Then you can distribute it to the spark workers directly in your distributed RDataFrame script, for example

import ROOT
SparkRDataFrame = ROOT.RDF.Experimental.Distributed.Spark.RDataFrame
# Create a distributed RDataFrame that will run on Spark
df = SparkRDataFrame(100)
# Signal header files needed in the application
# They will be sent to the Spark executors later
# NOTE THAT THIS INTERFACE IS SUBJECT TO CHANGE
df._headnode.backend.distribute_headers("myheader.hpp")


# Proceed with your analysis
# Functions written in myheader.hpp will be available in ROOT
# This filters out all numbers less than 5
df_filtered = df.Filter("check_number_less_than_5(rdfentry_)")
histo = df_filtered.Histo1D(("distributeheader","distributeheader",10,0,100), "rdfentry_")
c = ROOT.TCanvas("", "", 600, 600)
histo.Draw()
c.Draw()

Again, the interface is very experimental, but the support for the functionality is definitely planned and intended. If you find any issues please let me know, would be deeply appreciated :slightly_smiling_face:
Cheers,
Vincenzo

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.