Dear @wandering_particle ,
Indeed, there is no equivalent to merging two different ROOT RDataFrames (you can see many related posts, e.g. here. The closest equivalent is producing a merged dataset (i.e. a merged TTree via the AddFriend
mechanism) and then wrap the merged dataset with an RDataFrame.
As for how to parallelize even further the distributed execution, one approach could be booking all your RDataFrame graphs upfront and then sending all of them together to the cluster via RunGraphs
. An example:
RunGraphs = ROOT.RDF.Experimental.Distributed.RunGraphs
numpy_dicts = []
for weight in weight_list:
numpy_dicts.append(RDataFrame(...).AsNumpy(branches, lazy=True))
RunGraphs(numpy_dicts) # Triggers all the computation graphs concurrently
Note the usage of the kwarg lazy=True
in the call to AsNumpy
which ensures the operation is not triggered immediately at the call site. RunGraphs
then takes the list of AsNumpy
futures and triggers the corresponding computation graphs concurrently, so they get all executed together in the Spark cluster.
Afterwards, you can create pandas dataframes from the values contained in the list of numpy_dicts
.
Cheers,
Vincenzo