Chains vs single root file

Rutik · August 2, 2022, 1:23pm

Hi Experts!,

I am writing a code and I see that it takes more time to execute if I give the input as a bunch of .root files and TChain them. On the other hand let’s say I add the trees and make a single input file, it runs faster.

Does this reasoning sound right? If so, could anyone explain why this is happening?

Cheers,
Rutik

ikabadzhov · August 2, 2022, 4:10pm

Quick answer, this is extraordinary - maybe you are running on too small number of entries and see some overheads, which should become negligible.

To better answer your question, could you please let us know:

What is the total size of all input files?
Do you have a sequential path or you have EnableImplicitMT, or distRDF?
Are the files approximately of the same size?
Are you using ROOT master, or maybe v6.26 etc.?
In case of a TChain do you also use TEntryLists?

eguiraud · August 2, 2022, 4:14pm

does the file with the single tree use the same compression algorithm as the original files? (you can check with file yourfile.root, it should say something like “compression: 101” – and if the values differ between the original dataset and the new dataset that’s bound to change runtimes)

Rutik · August 2, 2022, 6:10pm

Hi @ikabadzhov,

Thank you for your response, okay I’ll try to answer the following points to my best knowledge (I’m quite new to this)

I have approx. 36GB of data
I don’t know what these commands are, so I’d say Im not using either for them, I just add all the files to a Chain using a while loop.
Yes, the files are approximately same size, I’d say 5% of it would be larger than the others but otherwise the same.
I’m using v6.26
I don’t use TEntryLists.

Cheers,
Rutik

Rutik · August 2, 2022, 6:20pm

Hi Enrico,

Thank you for the reply, So I didn’t add the whole of the 36 GB, but like half of it and that output file has a compression of 101. But the individual file has a compression of 1. So that different I suppose.

Best,
Rutik

eguiraud · August 2, 2022, 7:08pm

Uhm I’m not sure what 1 stands for, but this might be it. @pcanal what does a compression setting of 1 correspond to?

You can also add the files passing the -ff option to hadd, which tells it to use the same compression settings as the first of the input files – that should guarantee that the aggregated TTree is compressed with the same settings as the individual trees.

Now, a little runtime difference can be expected. The TChain has to do more work (at every entry, check whether it’s time to switch to a new file. open each new file and close the old one, etc.) – but as Ivan mentioned that difference should be small compared to the time spent in actual I/O and data processing. What runtimes are we talking about (after using hadd -ff)?

Cheers,
Enrico

pcanal · August 3, 2022, 6:10pm

If you are using hadd and want to make sure there is no unexpected decompression/recompression use the option -fk:

  -fk                                  Sets the target file to contain the baskets with the same compression
                                       as the input files (unless -O is specified). Compresses the meta data
                                       using the compression level specified in the first input or the
                                       compression setting after fk (for example 206 when using -fk206)

system · August 17, 2022, 6:10pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.