Best way to parallelize script involving RDataframe

Dear @wandering_particle ,

Indeed, there is no equivalent to merging two different ROOT RDataFrames (you can see many related posts, e.g. here. The closest equivalent is producing a merged dataset (i.e. a merged TTree via the AddFriend mechanism) and then wrap the merged dataset with an RDataFrame.

As for how to parallelize even further the distributed execution, one approach could be booking all your RDataFrame graphs upfront and then sending all of them together to the cluster via RunGraphs. An example:


RunGraphs = ROOT.RDF.Experimental.Distributed.RunGraphs

numpy_dicts = []

for weight in weight_list:
    numpy_dicts.append(RDataFrame(...).AsNumpy(branches, lazy=True))

RunGraphs(numpy_dicts) # Triggers all the computation graphs concurrently

Note the usage of the kwarg lazy=True in the call to AsNumpy which ensures the operation is not triggered immediately at the call site. RunGraphs then takes the list of AsNumpy futures and triggers the corresponding computation graphs concurrently, so they get all executed together in the Spark cluster.

Afterwards, you can create pandas dataframes from the values contained in the list of numpy_dicts.

Cheers,
Vincenzo