Merge NanoAOD root files

Soumyadip · June 1, 2020, 3:04pm

Hello,

I am trying to merge a large number of root files that are in the NanoAOD format. In each root file, there are 5 TTree and each TTree has several branches. When I am trying to merge them with “hadd” command, I am getting a large number of errors. And also only the first TTree is getting merged and the other TTree are absent in the final merged root file. Can you suggest something on this matter?

Thanks,
Soumyadip

pcanal · June 1, 2020, 3:22pm

what error do you get? which version of ROOT do you use?

Soumyadip · June 2, 2020, 7:57am

I am attaching the “txt file” of the error. I am using ROOT 6.18/02. log2.txt (9.3 KB) . Sometimes I only get “CORE DUMPED” for the same root files. So, I am a bit confused, what actually going wrong.

Thanks,
Soumyadip

pcanal · June 2, 2020, 3:46pm

Strange, it appears that the files “might” be corrupted. Are you able to read the input file (all branches and all entries)?

Soumyadip · June 2, 2020, 4:00pm

Yes, I can read all the branches and entries of all the trees of the input files.

pcanal · June 2, 2020, 4:13pm

Then I will need to explore further. Can you share the input file?

Soumyadip · June 2, 2020, 6:32pm

I am placing the input files in the public of my cern account.

/afs/cern.ch/work/s/sobarman/public/NanoAOD_skimmer

nmangane · June 3, 2020, 3:38am

I take it that haddnano.py doesn’t work for your use-case?

Soumyadip · June 3, 2020, 7:10am

I haven’t used “haddnano.py”. Could you please tell me what is the procedure to use it? And can it be used offline or I need proper environment for that?

nmangane · June 3, 2020, 9:51pm

This script is in the nanoAOD-tools repository, I don’t think you even need to source the standalone environment. It might also be in CMSSW as well. Syntax is basically the same as hadd, as seen in the first few lines of the code

github.com

cms-nanoAOD/nanoAOD-tools/blob/master/scripts/haddnano.py

#!/bin/env python
import ROOT
import numpy
import sys

if len(sys.argv) < 3 :
	print "Syntax: haddnano.py out.root input1.root input2.root ..."
ofname=sys.argv[1]
files=sys.argv[2:]

def zeroFill(tree,brName,brObj,allowNonBool=False) :
	# typename: (numpy type code, root type code)
	branch_type_dict = {'Bool_t':('?','O'), 'Float_t':('f4','F'), 'UInt_t':('u4','i'), 'Long64_t':('i8','L'), 'Double_t':('f8','D')}
	brType = brObj.GetLeaf(brName).GetTypeName()
	if (not allowNonBool) and (brType != "Bool_t") :
		print "Did not expect to back fill non-boolean branches",tree,brName,brObj.GetLeaf(br).GetTypeName()
	else :
		if brType not in branch_type_dict: raise RuntimeError, 'Impossible to backfill branch of type %s'%brType
		buff=numpy.zeros(1,dtype=numpy.dtype(branch_type_dict[brType][0]))
		b=tree.Branch(brName,buff,brName+"/"+branch_type_dict[brType][1])

This file has been truncated. show original

Soumyadip · June 4, 2020, 6:22pm

Thanks, it worked. It merges all the TTree.
The syntax I have used is -
"python haddnano.py out.root input1.root input2.root "

pcanal · June 9, 2020, 9:13pm

The files can also be merged via:

hadd -fk o.root  nano102x_on_mini102x_2018_data_d_NANO_ak4_98.root nano102x_on_mini102x_2018_data_d_NANO_ak4_99.root

The -fk request hadd to use for the output the same compression as for the input, otherwise it is uses the default compression for the output.

Because the NanoAOD files are (for good reasons) not compressed with the default algorithm without the -fk there is an implicit request to recompress the data instead of using “fast copying” Specifically this means that the input data needs to be decompress, unstreamed (objects created in memory) and streamed back and recompressed.

Then the warning:

Warning in <TClass::Init>: no dictionary for class edm::Hash<1> is available
Warning in <TClass::Init>: no dictionary for class edm::ParameterSetBlob is available
Warning in <TClass::Init>: no dictionary for class edm::ProcessHistory is available
Warning in <TClass::Init>: no dictionary for class edm::ProcessConfiguration is available
Warning in <TClass::Init>: no dictionary for class

tells us that the CMSSW libraries are not available on the LD_LIBRARY_PATH (and thus can not be automatically loaded by hadd.

This brings us to the actual cause of the problem (in addition to the inefficiency).

Some of the core CMSSW objects requires their library to be streamed, i.e. they are not self describing and this leads the I/O to make wrong guesses on the content :(.

Long story short you have 3 distinct solutions:

use haddnano.py script
make sure that CMSSW library are available and autoloadable
use the -fk switch.

Cheers,
Philippe.

pcanal · June 10, 2020, 12:47am

Looking deeper into the file, I discovered than the problem is not with the missing CMSSW library but rather a bug in TTree cloning, in how it deals with arrays that grows from one file to the other. For some reason the code that is intended to support this case (and does in most case) is not triggered in the case of those. So I am investigating why.

Soumyadip · June 10, 2020, 6:38pm

Thanks for your detailed clarification.

pcanal · June 11, 2020, 6:32pm

The problem is now fixed in the master and the following upcoming releases ( * 6.14/10, 6.16/02, 6.18/06, 6.22/00, 6.24/00, 6.20/08 but not 6.20/06).

Cheers,
Philippe.

Soumyadip · June 12, 2020, 8:04am

So, from the above mentioned release “hadd” will work ? or we have to use “hadd -fk” as well in that releases ?

pcanal · June 12, 2020, 8:31pm

With those upcoming release hadd will just work. However (unless you mean to have the file compressed with zlib intentionally), using -fk will be much much much faster [We are planning on making -fk the default in an upcoming release]

Soumyadip · June 13, 2020, 4:57am

Great. Just one last question. The size of NanoAOD root files are huge. So. is there any way to compress the size of merged root files ?

pcanal · June 13, 2020, 6:41pm

It is probably already well compressed. To see if you can go further (with the fixed hadd) you can play with the argument:

  -f6                                  Use compression level 6. (See TFile::SetCompressionSettings for the support range of value.)

with -f207 might be the smallest output file.

Soumyadip · June 13, 2020, 8:02pm

Ok. Thanks for your help.