RDataFrame::Snapshot split up files by sizes or TTree chunk for later chaining

Dear experts
Altough i am fully aware that save disk space is important and that data reduction is a very important aspect of efficient code, i was wondering if Root support natively a mechanism which allows to take a TChain and break It up in smaller files to later rechain.
I am facing this issue while processing with RDataFrame O(10) Tb of data, apply some loose filtering and snapshotting only half of the branches.
When doing so, my output single tfile has a size of O(100Gb), however when copying the file on the eos user area i have the problem that Only files up to 40Gb can be transferred or stored.

Therefore i would Need to chunk in pieces my snapshot TTree before copying each file.
Does Snapshot allow to do this? Is there a simple function in root TFile/TTree allowing to do so?
Should i code It by myself?

Thanks in Advance

Maybe @Axel or @vpadulan can help here

1 Like

Thank you,
Here is what I use for the moment, (splitting in 3 parts of equal size) [ tough a bit ugly solution if one wants to split in chunks of file size of 20 Gb out of a file of XXXGb.

    snapshotOptions = r.RDF.RSnapshotOptions()
    snapshotOptions.fLazy = True
    snp0 = node.Filter("rdfentry_ % 3 == 0").Snapshot( "DecayTree", f'FilterFakeMaps_{YEAR}_{PART}_B2emu_0.root', cols_keep, snapshotOptions)
    snp1 = node.Filter("rdfentry_ % 3 == 1").Snapshot( "DecayTree", f'FilterFakeMaps_{YEAR}_{PART}_B2emu_1.root', cols_keep, snapshotOptions)
    snp2 = node.Filter("rdfentry_ % 3 == 2").Snapshot( "DecayTree", f'FilterFakeMaps_{YEAR}_{PART}_B2emu_2.root', cols_keep, snapshotOptions)
    print(colored("Snapshotting", "red"))
    print(colored("Snapshotting done", "green"))

However, does this work ( rdfentry_% splitting) with enable implicit MT?
It might be a good thing to have the ‘split-up’ option within the SnapshotOptions flag.

rooteventselector --help

Thank you for the reply, i saw it as well but it doesn’t very well match what i would like to achieve since you need to have prior knowledge of the n-entries of the tree which i want to be agnostic of before running the command line tool.
I am ok using it within some other script. For the moment i use the rdfentry% option, it would be a nice feature to have a stop and continue for 10 times when actually making a Snapshot, Internally slicing automatically.

SnapshotOptions.splitFileSize = 10
Snapshot( "DecayTree","myOutputFile.root", columnsKeep, SnapshotOptions)

Creating myOutputFile_{i}.root each with a DecayTree

ROOT Forum → Search → “root-split”

Thanks, i also looked this up and found the answer not satisfactory as I need go root which is something you don’t get for free on usual conda environment I use on lxplus and install it as external package would be an overkill to do this.

Binary distributions by @sbinet are “statically linked”, so you need nothing.

1 Like

all the binaries shipped with groot are statically linked. (if not, that’s a bug).

so you don’t need a Go nor a Go-HEP installation.
one just has to use the OS+Arch specific binary. and voilà.

(the one caveat being that groot can not read/write all the ROOT files)

1 Like

@RENATO_QUAGLIANI let me know if you have issues w/ groot’s root-split.

(here, or on the Go-HEP mailing list)


Another solution is to exploit the distributed version of RDataFrame that accepts an optional input argument of the number of splits that you want to divide your input dataset in. This will directly translate to the same number of tasks, i.e. the same number of Snapshot calls so you will already get a splitted output dataset. An example:

import ROOT
from dask.distributed import Client, LocalCluster

# Decide how many processes for parallelisation and how many splits of the dataset

# Point RDataFrame calls to the Dask specific RDataFrame
RDataFrame = ROOT.RDF.Experimental.Distributed.Dask.RDataFrame

def create_connection():
    cluster = LocalCluster(n_workers=NWORKERS, threads_per_worker=1, processes=True, memory_limit="2GiB")
    client = Client(cluster)
    return client

if __name__ == "__main__":
    # Create the connection to the mock Dask cluster on the local machine
    connection = create_connection()
    # Create an RDataFrame that will use Dask as a backend for computations
    # Give it a treename, filename pair
    # or any other TTree-based constructor of RDataFrame works
    df = RDataFrame(TREENAME, FILENAME, daskclient=connection, npartitions=NPARTITIONS)

    # Your analysis here...

    df.Snapshot("out_treename", "out_filename.root")

When calling Snapshot("out_treename", "out_filename.root") the distributed RDataFrame will create separate files, one per partition. In the example above where I set NPARTITIONS=2 (you can decide this number), the output files will be out_filename_0.root and out_filename_1.root.

This works on a single machine or also on a cluster of machines (with a bit of extra setup this works also on the HTCondor pools available from lxplus).


1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.