Chains vs single root file

Hi Experts!,

I am writing a code and I see that it takes more time to execute if I give the input as a bunch of .root files and TChain them. On the other hand let’s say I add the trees and make a single input file, it runs faster.

Does this reasoning sound right? If so, could anyone explain why this is happening?

Cheers,
Rutik

Quick answer, this is extraordinary - maybe you are running on too small number of entries and see some overheads, which should become negligible.

To better answer your question, could you please let us know:

  1. What is the total size of all input files?
  2. Do you have a sequential path or you have EnableImplicitMT, or distRDF?
  3. Are the files approximately of the same size?
  4. Are you using ROOT master, or maybe v6.26 etc.?
  5. In case of a TChain do you also use TEntryLists?
  1. does the file with the single tree use the same compression algorithm as the original files? (you can check with file yourfile.root, it should say something like “compression: 101” – and if the values differ between the original dataset and the new dataset that’s bound to change runtimes)

Hi @ikabadzhov,

Thank you for your response, okay I’ll try to answer the following points to my best knowledge (I’m quite new to this)

  1. I have approx. 36GB of data
  2. I don’t know what these commands are, so I’d say Im not using either for them, I just add all the files to a Chain using a while loop.
  3. Yes, the files are approximately same size, I’d say 5% of it would be larger than the others but otherwise the same.
  4. I’m using v6.26
  5. I don’t use TEntryLists.

Cheers,
Rutik

Hi Enrico,

Thank you for the reply, So I didn’t add the whole of the 36 GB, but like half of it and that output file has a compression of 101. But the individual file has a compression of 1. So that different I suppose.

Best,
Rutik

Uhm I’m not sure what 1 stands for, but this might be it. @pcanal what does a compression setting of 1 correspond to?

You can also add the files passing the -ff option to hadd, which tells it to use the same compression settings as the first of the input files – that should guarantee that the aggregated TTree is compressed with the same settings as the individual trees.

Now, a little runtime difference can be expected. The TChain has to do more work (at every entry, check whether it’s time to switch to a new file. open each new file and close the old one, etc.) – but as Ivan mentioned that difference should be small compared to the time spent in actual I/O and data processing. What runtimes are we talking about (after using hadd -ff)?

Cheers,
Enrico

1 Like

If you are using hadd and want to make sure there is no unexpected decompression/recompression use the option -fk:

  -fk                                  Sets the target file to contain the baskets with the same compression
                                       as the input files (unless -O is specified). Compresses the meta data
                                       using the compression level specified in the first input or the
                                       compression setting after fk (for example 206 when using -fk206)
1 Like