Dear experts
Altough i am fully aware that save disk space is important and that data reduction is a very important aspect of efficient code, i was wondering if Root support natively a mechanism which allows to take a TChain and break It up in smaller files to later rechain.
I am facing this issue while processing with RDataFrame O(10) Tb of data, apply some loose filtering and snapshotting only half of the branches.
When doing so, my output single tfile has a size of O(100Gb), however when copying the file on the eos user area i have the problem that Only files up to 40Gb can be transferred or stored.
Therefore i would Need to chunk in pieces my snapshot TTree before copying each file.
Does Snapshot allow to do this? Is there a simple function in root TFile/TTree allowing to do so?
Should i code It by myself?
Thank you,
Here is what I use for the moment, (splitting in 3 parts of equal size) [ tough a bit ugly solution if one wants to split in chunks of file size of 20 Gb out of a file of XXXGb.
However, does this work ( rdfentry_% splitting) with enable implicit MT?
It might be a good thing to have the âsplit-upâ option within the SnapshotOptions flag.
Thank you for the reply, i saw it as well but it doesnât very well match what i would like to achieve since you need to have prior knowledge of the n-entries of the tree which i want to be agnostic of before running the command line tool.
I am ok using it within some other script. For the moment i use the rdfentry% option, it would be a nice feature to have a stop and continue for 10 times when actually making a Snapshot, Internally slicing automatically.
Thanks, i also looked this up and found the answer not satisfactory as I need go root which is something you donât get for free on usual conda environment I use on lxplus and install it as external package would be an overkill to do this.
Another solution is to exploit the distributed version of RDataFrame that accepts an optional input argument of the number of splits that you want to divide your input dataset in. This will directly translate to the same number of tasks, i.e. the same number of Snapshot calls so you will already get a splitted output dataset. An example:
import ROOT
from dask.distributed import Client, LocalCluster
# Decide how many processes for parallelisation and how many splits of the dataset
NWORKERS = 2
NPARTITIONS = 2
# Point RDataFrame calls to the Dask specific RDataFrame
RDataFrame = ROOT.RDF.Experimental.Distributed.Dask.RDataFrame
def create_connection():
cluster = LocalCluster(n_workers=NWORKERS, threads_per_worker=1, processes=True, memory_limit="2GiB")
client = Client(cluster)
return client
if __name__ == "__main__":
# Create the connection to the mock Dask cluster on the local machine
connection = create_connection()
# Create an RDataFrame that will use Dask as a backend for computations
# Give it a treename, filename pair
# or any other TTree-based constructor of RDataFrame works
df = RDataFrame(TREENAME, FILENAME, daskclient=connection, npartitions=NPARTITIONS)
# Your analysis here...
df.Snapshot("out_treename", "out_filename.root")
When calling Snapshot("out_treename", "out_filename.root") the distributed RDataFrame will create separate files, one per partition. In the example above where I set NPARTITIONS=2 (you can decide this number), the output files will be out_filename_0.root and out_filename_1.root.
This works on a single machine or also on a cluster of machines (with a bit of extra setup this works also on the HTCondor pools available from lxplus).