Dear @Chinmay ,
Distributed RDataFrame aims at being as close to local RDataFrame as possible. Snapshot
shows some differences when executed distributedly though, which I will try to explain here.
First off, you need an environment with ROOT 6.24 and pyspark on both the client machine and all the cluster machines (scheduler + workers). This can be set up in SWAN or using cvmfs for example.
Then, you also need to have access to some storage from all the workers. For example a call like
Snapshot("mytree", "/path/to/my/snapshot.root")
means that â/path/to/my/snapshot.rootâ must be a valid path from any worker of your cluster.
Now we can imagine a very simple distributed RDataFrame application with the Snapshot
operation:
import ROOT
SparkRDataFrame = ROOT.RDF.Experimental.Distributed.Spark.RDataFrame
# Create a distributed RDataFrame that will run on Spark
df = SparkRDataFrame(100)
# Define a single column of 100 consecutive numbers
df_x = df.Define("x","rdfentry_")
# Snapshot the dataset distributedly
snapdf = df_x.Snapshot("mytree","myfile.root")
If I run this on my laptop (where I have both ROOT 6.24 and pyspark), I will see that the snapshot operation has created two files in the folder where I ran the application:
$: ls
myfile_0_49.root myfile_50_99.root
Note that the filenames correspond to the second argument given to the Snapshot
operation, plus 2 numbers representing the ranges of entries that are being saved to that particular file.
This derives from the fact that distributed RDataFrame splits the workload of the application along the entries of the dataset in multiple (exclusive) ranges. For our case, we had 100 entries, they were split by default in two ranges of 50 entries each, thus we get two snapshotted files with entries [0,49] and [50,99] respectively.
This behaviour will be the same on a set of distributed resources. Each worker node will get some ranges of entries to process and will make a Snapshot of that range. Note that you can also pass an optional argument to the distributed RDataFrame constructor to signal in how many ranges you would like to split your dataset:
import ROOT
SparkRDataFrame = ROOT.RDF.Experimental.Distributed.Spark.RDataFrame
df = SparkRDataFrame(1000, npartitions=10)
If you call Snapshot
on this particular df for example you would save your output tree in 10 separate files.
As for your last question, Spark doesnât know about the internals of ROOT files or trees. So you cannot make a Spark DataSet/DataFrame out of a snapshotted tree . But in principle RDataFrame should already provide you with the functions to run your analysis, so there shouldnât be the need to read ROOT data in a Spark DataFrame.
Cheers,
Vincenzo