Hadd in root 5.18.00d on lxplus

pvankov · August 1, 2008, 1:57pm

HI,

I meet problems with hadd on lxplus. I’m using /afs/cern.ch/sw/lcg/external/root/5.18.00d/slc4_ia32_gcc34/root/bin/root.

Trying to hadd the files:
/afs/cern.ch/user/v/vankovp/public/CBNTTimeDependentHists_1.root
/afs/cern.ch/user/v/vankovp/public/CBNTTimeDependentHists_2.root

I’m getting the following:

Target file: my.root
Source file 1: CBNTTimeDependentHists_1.root
Warning in TClass::TClass: no dictionary for class AttributeListLayout is available
Warning in TClass::TClass: no dictionary for class pair<string,string> is available
Source file 2: CBNTTimeDependentHists_2.root
Target path: my.root:/
Unknown object type, name: TObject title: Basic ROOT object
Found subdirectory StripHits
Target path: my.root:/StripHits
error -4 in deflateInit (zlib)
terminate called after throwing an instance of 'std::bad_alloc’
what(): St9bad_alloc
Abort
126.247u 7.834s 2:15.57 98.8% 0+0k 0+0io 12pf+0w

Thanks in advance for any help,
Peter.

pcanal · August 1, 2008, 8:56pm

Hi,

This problem has been fixed in ROOT v5.20.

Cheers,
Philippe.

pvankov · August 4, 2008, 11:34am

The problem is with the ROOT on lxplus. I’m able to run it with both ROOT 5.18 and 5.20 on my laptop but on lxplus it crashes …
Peter.

pvankov · August 4, 2008, 2:00pm

Hi again,

Could it be that because “hadd” is a very much memory consuming operation, at some moment the runtime memory limit on lxplus is reached and the process is cut off?

Thanks, Peter.

pcanal · August 4, 2008, 2:30pm

Hi,

I was able to produce the crash with 5.18 on my machine (outside of lxplus). So I suspect it is more likely that this is a problem where the code is incorrect in such a way that the behavior is random. (i.e. it works by luck on your machine).

Cheers,
Philippe.

pvankov · August 4, 2008, 2:42pm

Hi Philippe,

Ok, it might be the case, thanks.

But running with ROOT 5.20 on lxplus is crashing too. I think it is due to the reached users runtime-memory limit (~1.4G) that is (probably) in force on lxplus.

Do you know whether it is normal hadd to consume so much memory? Isn’t this an indication of an internal memory leak problem?

Cheers, Peter.

pcanal · August 4, 2008, 3:17pm

Hi,

1.6 Gb is the size in memory of your data set. You have 8176 histograms with 50000 bins. hadd loads each and every histogram from the first file then close it, then load and add the histogram from the second file.

Cheers,
Philippe.

vadler · September 30, 2008, 9:00am

Hi, Philippe!

I’d like to revive this thread, since I’m facing a similar problem:
Using hadd with e.g. two input files is fine, with e.g. six hadd crashes with error messages like (snippets, full stdout in /afs/cern.ch/user/v/vadler/public/RooT/hadd.stdout):

error -4 in deflateInit (zlib)
R__unzip: error in inflateInit (zlib)
*** Break *** segmentation violation
error -4 in deflateInit (zlib)
(no debugging symbols found)
Using host libthread_db library “/lib64/tls/libthread_db.so.1”.
Attaching to program: /proc/19897/exe, process 19897
(no debugging symbols found)…done.
[…]
(no debugging symbols found)…done.
[Thread debugging using libthread_db enabled]
[New Thread 1459541856 (LWP 19897)]
0xffffe410 in __kernel_vsyscall ()
#1 0x004a8f13 in __waitpid_nocancel () from /lib/tls/libc.so.6
#2 0x004527b9 in do_system () from /lib/tls/libc.so.6
#3 0x0059198d in system () from /lib/tls/libpthread.so.0
#4 0x5572ed4d in TUnixSystem::Exec ()
from /afs/cern.ch/cms/sw/slc4_ia32_gcc345/cms/cmssw/CMSSW_2_1_9/external/slc4_ia32_gcc345/lib/libCore.so
#5 0x557353d0 in TUnixSystem::StackTrace ()
from /afs/cern.ch/cms/sw/slc4_ia32_gcc345/cms/cmssw/CMSSW_2_1_9/external/slc4_ia32_gcc345/lib/libCore.so
#6 0x55731959 in TUnixSystem::DispatchSignals ()
from /afs/cern.ch/cms/sw/slc4_ia32_gcc345/cms/cmssw/CMSSW_2_1_9/external/slc4_ia32_gcc345/lib/libCore.so
#7 0x55731a03 in SigHandler ()
from /afs/cern.ch/cms/sw/slc4_ia32_gcc345/cms/cmssw/CMSSW_2_1_9/external/slc4_ia32_gcc345/lib/libCore.so
#8 0x55730aae in sighandler ()
from /afs/cern.ch/cms/sw/slc4_ia32_gcc345/cms/cmssw/CMSSW_2_1_9/external/slc4_ia32_gcc345/lib/libCore.so
#9
#10 0x0804a686 in MergeRootfile ()
[…]
#20 0x0804a2c2 in MergeRootfile ()
#21 0x0804bb59 in main ()
terminate called after throwing an instance of 'std::bad_alloc’
what(): St9bad_alloc
Abort

From you last posting I understand, that the behavior should not depend on the number of files – but it does. (Is this possibly the fix in RooT 5.20 you mentioned?)

I am currently using 5.18.00a-cms12 (CMSSW_2_1_9), but I also tried 5.21.02 without seeing any improvement. (In fact it crashes even faster – I suspect, CMSSW adaption missing here.)

Does this finally mean, that I have to wait until a 5.2X version of RooT is used whithin CMSSW?

Thank you in advance…

Ciao
Volker

pcanal · September 30, 2008, 2:42pm

Hi Volker,

To pursue this further I would need access to the input file in order to try to reproduce the problem.

Cheers,
Philippe.

vadler · September 30, 2008, 5:47pm

Hi, Philippe!

The files are on CASTOR:
/castor/cern.ch/user/c/cctrack/DQM/R000057553

Ciao
Volker

pcanal · September 30, 2008, 8:37pm

Hi,

I do not have (easy?) access to castor.

Cheers,
Philippe.

pcanal · October 7, 2008, 7:53pm

Hi Volker,

Those files have a very large number of histograms and hadd might be too generic of a tool to handle this specific case. For example the closing of the file is slow because hadd can not make any assumption about the organization of the file while you can. I am guessing that those files are the same or similar to those Philipp Schieferdecker has been working on and he may already have a pre-packages tool for you to use.

Cheers,
Philippe.

pcanal · October 7, 2008, 9:14pm

Hi Volker,

I ran your successfully example with 1, 2 and 3 files. However the memory used goes pretty high (up to 1.9 Gb) and increase slightly with the number of files. So even-though there is no leak per se, hadd is not appropriate for your number of histograms if you have only 2Gb of RAM. [For performance reason in the general case, hadd keeps in memory all the histograms that are being added; the alternative would to write (and then read back) each histogram for each being read].

Cheers,
Philippe.

vadler · October 7, 2008, 9:34pm

Hi, Philippe!

So, it looks like I have to accept the status as is for the moment
However, thanx a lot for your diagnosis…

Ciao
Volker