I am trying to merge a large number of root files that are in the NanoAOD format. In each root file, there are 5 TTree and each TTree has several branches. When I am trying to merge them with “hadd” command, I am getting a large number of errors. And also only the first TTree is getting merged and the other TTree are absent in the final merged root file. Can you suggest something on this matter?
I am attaching the “txt file” of the error. I am using ROOT 6.18/02. log2.txt (9.3 KB) . Sometimes I only get “CORE DUMPED” for the same root files. So, I am a bit confused, what actually going wrong.
I haven’t used “haddnano.py”. Could you please tell me what is the procedure to use it? And can it be used offline or I need proper environment for that?
This script is in the nanoAOD-tools repository, I don’t think you even need to source the standalone environment. It might also be in CMSSW as well. Syntax is basically the same as hadd, as seen in the first few lines of the code
The -fk request hadd to use for the output the same compression as for the input, otherwise it is uses the default compression for the output.
Because the NanoAOD files are (for good reasons) not compressed with the default algorithm without the -fk there is an implicit request to recompress the data instead of using “fast copying” Specifically this means that the input data needs to be decompress, unstreamed (objects created in memory) and streamed back and recompressed.
Then the warning:
Warning in <TClass::Init>: no dictionary for class edm::Hash<1> is available
Warning in <TClass::Init>: no dictionary for class edm::ParameterSetBlob is available
Warning in <TClass::Init>: no dictionary for class edm::ProcessHistory is available
Warning in <TClass::Init>: no dictionary for class edm::ProcessConfiguration is available
Warning in <TClass::Init>: no dictionary for class
tells us that the CMSSW libraries are not available on the LD_LIBRARY_PATH (and thus can not be automatically loaded by hadd.
This brings us to the actual cause of the problem (in addition to the inefficiency).
Some of the core CMSSW objects requires their library to be streamed, i.e. they are not self describing and this leads the I/O to make wrong guesses on the content :(.
Long story short you have 3 distinct solutions:
use haddnano.py script
make sure that CMSSW library are available and autoloadable
Looking deeper into the file, I discovered than the problem is not with the missing CMSSW library but rather a bug in TTree cloning, in how it deals with arrays that grows from one file to the other. For some reason the code that is intended to support this case (and does in most case) is not triggered in the case of those. So I am investigating why.
With those upcoming release hadd will just work. However (unless you mean to have the file compressed with zlib intentionally), using -fk will be much much much faster [We are planning on making -fk the default in an upcoming release]