Changing basket size in RDataFrame::Snapshot

Dear ROOT experts,

I’m wondering if there’s a way to set the basket size used in the output file produced by RDataFrame::Snapshot? In the RSnapshotOptions that I can pass to the function call, I only see that the value for the AutoFlush can be configured.

To shortly describe my motivation for doing this: I’ve produced a set of ROOT files where I had to reduce the basket size to 2kB (to reduce memory consumption for a computing grid as I had to write out a lot of branches) from the default 32kB. I’d like now to “re-write” the trees (along with adding some new columns) locally where memory consumption is less of an issue with larger basket sizes again.

From some tests, I deduced that branches in the output tree from Snapshot() that existed in the input tree will be written out using the same basket size as in the input tree while the branches of new columns will have the default basket size of 32kB. I.e. assuming that branch_x is contained in my input tree with a basket size of 2kB, then after doing

df = ROOT.RDataFrame("tree", "input_file.root")
df = df.Redefine("branch_x", "branch_x")
df = df.Define("branch_x_new", "branch_x")
df.Snapshot("tree", "input_file_repr.root", ["branch_x", "branch_x_new"])

After this, branch_x in the new file has (according to TTree::Print()) a basket size of 2kB while interestingly branch_x_new has a basket size of 32kB - is there any way to redefine columns such that when they are written out with a different basket size than in their input file?

Many thanks!
Michael

ROOT Version: 6.28/12
Platform: linuxx8664gcc
Compiler: g++ (GCC) 13.1.0


First, welcome to the ROOT Forum!
Then, according to the ROOT Version 6.24 Release Notes :

Behavior changes

  • Snapshot now respects the basket size and split level of the original branch when copying branches to a new TTree.

So maybe @vpadulan can comment on this

Dear @miholzbo ,

Thanks for reaching out to the forum! Your situation looks a bit extreme, I wonder how exactly you end up having to modify the default basket size.

At the moment, this knob is not available in the RDataFrame Snapshot options, but the good news is that it’s not too difficult to implement. I have created a github issue to keep track of this [1], keep in mind that at the moment this would have low priority. Of course any contribution would be appreciated!

Thanks,
Vincenzo

[1] Add option to change default basket size in RDataFrame Snapshot · Issue #17418 · root-project/root · GitHub

1 Like

Dear @vpadulan and @bellenot,

Many thanks for the responses and the nice welcome to the forum! It’s at least re-assuring that I didn’t miss anything obvious how to do this and thanks for opening a issue such that this might become possible in the future.

We had to write out quite a lot of branches (for various experimental variations, around O(40k)) and with the default basket size of 32kB this yields already a memory consumption of > 1GB. In the computing grid we are using jobs shouldn’t require more than 2GB which caused issues for some of the samples we processed. That’s why we reduced the basket size to 4kB. But we noticed that this also seems to increase the disk space needed for the output ROOT files (I’ve to admit I don’t fully understand this though) and by “re-processing” our output files we wanted to mitigate this again.
I think smaller basket sizes makes the read-out also a bit slower, so this would be another motivation to re-write TTrees with larger baskets, which would be super simple via RDataFrame::SnapShot() if it allows to set the basket size.

Thanks!
Michael

1 Like

I see,

Have you seen if there is also a dependency between the event range processed by each job and the memory used? i.e. what happens if you submit jobs that end up writing smaller TTree objects with less entries per tree?