Large changes in branch sizes in later builds of ROOT

Dear @Eirik_Gramstad ,

This forum post triggered an investigation into the effects of compression of TTree datasets like yours. One of the key characteristics of the input file you shared is that it has many branches of type RVec where many and often all of the vectors are actually empty. It turns out that in this particular scenario the TTree dataset is compressed better by the ZLIB algorithm (in particular the vanilla ZLIB implementation, that’s what’s available on the lxplus node, and not the other popular zlib-ng implementation that is available on many Linux systems like my workstation) than the ZSTD algorithm. This went completely against our prior knowledge and understanding.

I’m mentioning this because the issue you see is due to the change in the compression algorithm used by Snapshot (and in fact can be seen even without changing ROOT version and using 6.36 but just changing the compression settings). The default was changed in 6.38 after internal discussion following the available knowledge. This was indicated in the release notes at ROOT Version 6.38 Release Notes and it is also visible in your own script the first time you execute it with a ROOT version greater than 6.36 with the following message:

In ROOT 6.38, the default compression settings of Snapshot have been changed from 101 (ZLIB with compression level 1, the TTree default) to 505 (ZSTD with compression level 5). ...

So practically what you are seeing are the effects of the RDataFrame Snapshot compressing your data with ZSTD level 10 (ROOT compression setting 505, the default in 6.38) vs ZLIB level 1 (ROOT compression setting 101, the default before).

The details of the full investigation are available at GitHub - vepadulano/ttree-lossless-compression-studies: This is a collection of programs to study the behaviour of different compression algorithms used by ROOT to compress datasets in the TTree format. · GitHub

Following this, I have opened a PR to revert the Snapshot behaviour in light of the new knowledge Revert choice to change default Snapshot TTree compression settings by vepadulano · Pull Request #21753 · root-project/root · GitHub

In the meanwhile, you could try with a quick workaround by setting explicitly the compression settings for your Snapshot calls, e.g.:

opts = ROOT.RDF.RSnapshotOptions()
opts.fCompressionAlgorithm = ROOT.RCompressionSetting.EAlgorithm.EValues.kZLIB
opts.fCompressionLevel = 1
df.Snapshot(output_treename, output_filename, columns_list, opts)

Cheers,
Vincenzo