File size reduction (factor 2) : understand compression applied

Dear ROOT experts,

I come up with a question concerning TTree streaming to disk and something we observed in some of our ntuples.

Our ntuples are processed in subsequential steps. Let’s say STEP1 and STEP2.

What STEP1 does, is to produce a ntuple + (nSTEP1 branches ) out of a compiled C++ code producing file1

In STEP2 we take the outcome of STEP1 and we attach new branches on TOP (nSTEP2) producing file2, this time using pyROOT and another source code.

According to this, one would expect file SIze of file2 is > than file1. However what we see is that size(file2) is 1/2 of size(file1).
What we observe in those 2 files on some common branches is that the Baskets value is very different on some very repetitive-value branches.
We checked and n-branches increased, as well as that all branches are sane and correctly propagated.
The only difference we observed is the nBaskets and compression level.

Therefore the question: is that possible that the compression is able to reduce by a factor 2 the size of an ntuple? From what does it depend and how can one explicitely check the reason of the file-size reduction?

Thanks in advance
Renato

Hi Renato,

A factor 2 from compression level sounds quite a lot but it is not impossible. You can use TTree::Print() in order to get detailed information about the overall compression ratio as well as the compression ratio of individual branches.

Cheers,
Jakob

Hi @jblomer,
This is what i did infact.
It’s not just clear to me why in a C++ compiled code where we run

auto newTrree = (TTre*)oldTree.CopyTree( "CUT") ; 
// DO STUFF on newTree to attach new tuples
newTree.Write( "", TObject::kOverwrite ); 

Is somehow behaving differently on when we do a similar thing using pyROOT.
I.e how does the compression get optimized in ROOT ? Is there any recommended way to have a good balance and ensure “file-size” remain roughtly constant whatever are the “processing” one does?

I’d invite @pcanal to advise on best practices

Hi Renato,

I am missing some context to understand what is going and how to advice properly.

Can provide the TTree::Print for the large and the small file?

Thanks,
Philippe.

Dear @pcanal , thanks a lot for jumping into the topic.

Here the raw file size from FILE1

11G 16 set 17.48 FILE1 : /eos/lhcb/wg/RD/RKstar/tuples/v9/RKst/TupleProcess_EE_SKIMALL_LOWQ2/Bd2KstEE/MC12MD/0/TupleProcess.root
6,1G 18 set 18.48 FILE2 : /eos/lhcb/wg/RD/RKstar/tuples/v9/RKst/TupleProcess_EE_BDT_LOWQ2/Bd2KstEE/MC12MD/0/TupleProcess.root

where FILE2 has been obtained from FILE1 adding a couple of branches in a pyROOT session.

Here the log files of the Print() call on the 2 TTree in the files.
The code i used to make the dump prints

f.ls()
DecayTuple->Print();
MCDecayTuple->Print();

file1_content.txt (291.2 KB) file2_content.txt (291.3 KB)

Thanks a lot ,
Renato

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.