Merge NanoAOD root files

Hello,

I am trying to merge a large number of root files that are in the NanoAOD format. In each root file, there are 5 TTree and each TTree has several branches. When I am trying to merge them with “hadd” command, I am getting a large number of errors. And also only the first TTree is getting merged and the other TTree are absent in the final merged root file. Can you suggest something on this matter?

Thanks,
Soumyadip

what error do you get? which version of ROOT do you use?

I am attaching the “txt file” of the error. I am using ROOT 6.18/02. log2.txt (9.3 KB) . Sometimes I only get “CORE DUMPED” for the same root files. So, I am a bit confused, what actually going wrong.

Thanks,
Soumyadip

Strange, it appears that the files “might” be corrupted. Are you able to read the input file (all branches and all entries)?

Yes, I can read all the branches and entries of all the trees of the input files.

Then I will need to explore further. Can you share the input file?

I am placing the input files in the public of my cern account.

/afs/cern.ch/work/s/sobarman/public/NanoAOD_skimmer

I take it that haddnano.py doesn’t work for your use-case?

I haven’t used “haddnano.py”. Could you please tell me what is the procedure to use it? And can it be used offline or I need proper environment for that?

This script is in the nanoAOD-tools repository, I don’t think you even need to source the standalone environment. It might also be in CMSSW as well. Syntax is basically the same as hadd, as seen in the first few lines of the code

Thanks, it worked. It merges all the TTree.
The syntax I have used is -
"python haddnano.py out.root input1.root input2.root "

The files can also be merged via:

hadd -fk o.root  nano102x_on_mini102x_2018_data_d_NANO_ak4_98.root nano102x_on_mini102x_2018_data_d_NANO_ak4_99.root 

The -fk request hadd to use for the output the same compression as for the input, otherwise it is uses the default compression for the output.

Because the NanoAOD files are (for good reasons) not compressed with the default algorithm without the -fk there is an implicit request to recompress the data instead of using “fast copying” Specifically this means that the input data needs to be decompress, unstreamed (objects created in memory) and streamed back and recompressed.

Then the warning:

Warning in <TClass::Init>: no dictionary for class edm::Hash<1> is available
Warning in <TClass::Init>: no dictionary for class edm::ParameterSetBlob is available
Warning in <TClass::Init>: no dictionary for class edm::ProcessHistory is available
Warning in <TClass::Init>: no dictionary for class edm::ProcessConfiguration is available
Warning in <TClass::Init>: no dictionary for class 

tells us that the CMSSW libraries are not available on the LD_LIBRARY_PATH (and thus can not be automatically loaded by hadd.

This brings us to the actual cause of the problem (in addition to the inefficiency).

Some of the core CMSSW objects requires their library to be streamed, i.e. they are not self describing and this leads the I/O to make wrong guesses on the content :(.

Long story short you have 3 distinct solutions:

  • use haddnano.py script
  • make sure that CMSSW library are available and autoloadable
  • use the -fk switch.

Cheers,
Philippe.

Looking deeper into the file, I discovered than the problem is not with the missing CMSSW library but rather a bug in TTree cloning, in how it deals with arrays that grows from one file to the other. For some reason the code that is intended to support this case (and does in most case) is not triggered in the case of those. So I am investigating why.

Thanks for your detailed clarification.

The problem is now fixed in the master and the following upcoming releases ( * 6.14/10, 6.16/02, 6.18/06, 6.22/00, 6.24/00, 6.20/08 but not 6.20/06).

Cheers,
Philippe.

So, from the above mentioned release “hadd” will work ? or we have to use “hadd -fk” as well in that releases ?

With those upcoming release hadd will just work. However (unless you mean to have the file compressed with zlib intentionally), using -fk will be much much much faster [We are planning on making -fk the default in an upcoming release]

Great. Just one last question. The size of NanoAOD root files are huge. So. is there any way to compress the size of merged root files ?

It is probably already well compressed. To see if you can go further (with the fixed hadd) you can play with the argument:

  -f6                                  Use compression level 6. (See TFile::SetCompressionSettings for the support range of value.)

with -f207 might be the smallest output file.

Ok. Thanks for your help.