File size reduction (factor 2) : understand compression applied

RENATO_QUAGLIANI · September 21, 2020, 9:44am

Dear ROOT experts,

I come up with a question concerning TTree streaming to disk and something we observed in some of our ntuples.

Our ntuples are processed in subsequential steps. Let’s say STEP1 and STEP2.

What STEP1 does, is to produce a ntuple + (nSTEP1 branches ) out of a compiled C++ code producing file1

In STEP2 we take the outcome of STEP1 and we attach new branches on TOP (nSTEP2) producing file2, this time using pyROOT and another source code.

According to this, one would expect file SIze of file2 is > than file1. However what we see is that size(file2) is 1/2 of size(file1).
What we observe in those 2 files on some common branches is that the Baskets value is very different on some very repetitive-value branches.
We checked and n-branches increased, as well as that all branches are sane and correctly propagated.
The only difference we observed is the nBaskets and compression level.

Therefore the question: is that possible that the compression is able to reduce by a factor 2 the size of an ntuple? From what does it depend and how can one explicitely check the reason of the file-size reduction?

Thanks in advance
Renato

jblomer · September 21, 2020, 12:26pm

Hi Renato,

A factor 2 from compression level sounds quite a lot but it is not impossible. You can use TTree::Print() in order to get detailed information about the overall compression ratio as well as the compression ratio of individual branches.

Cheers,
Jakob

RENATO_QUAGLIANI · September 21, 2020, 12:31pm

Hi @jblomer,
This is what i did infact.
It’s not just clear to me why in a C++ compiled code where we run

auto newTrree = (TTre*)oldTree.CopyTree( "CUT") ; 
// DO STUFF on newTree to attach new tuples
newTree.Write( "", TObject::kOverwrite );

Is somehow behaving differently on when we do a similar thing using pyROOT.
I.e how does the compression get optimized in ROOT ? Is there any recommended way to have a good balance and ensure “file-size” remain roughtly constant whatever are the “processing” one does?

jblomer · September 21, 2020, 12:33pm

I’d invite @pcanal to advise on best practices

pcanal · September 21, 2020, 5:15pm

Hi Renato,

I am missing some context to understand what is going and how to advice properly.

Can provide the TTree::Print for the large and the small file?

Thanks,
Philippe.

RENATO_QUAGLIANI · September 21, 2020, 7:51pm

Dear @pcanal , thanks a lot for jumping into the topic.

Here the raw file size from FILE1

11G 16 set 17.48 FILE1 : /eos/lhcb/wg/RD/RKstar/tuples/v9/RKst/TupleProcess_EE_SKIMALL_LOWQ2/Bd2KstEE/MC12MD/0/TupleProcess.root
6,1G 18 set 18.48 FILE2 : /eos/lhcb/wg/RD/RKstar/tuples/v9/RKst/TupleProcess_EE_BDT_LOWQ2/Bd2KstEE/MC12MD/0/TupleProcess.root

where FILE2 has been obtained from FILE1 adding a couple of branches in a pyROOT session.

Here the log files of the Print() call on the 2 TTree in the files.
The code i used to make the dump prints

f.ls()
DecayTuple->Print();
MCDecayTuple->Print();

file1_content.txt (291.2 KB) file2_content.txt (291.3 KB)

Thanks a lot ,
Renato

system · October 5, 2020, 7:51pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.