Upper limit on size of root file with hadd

Hi,

I am trying to hadd 15 root files whose total size is ~18Gb. It starts hadding the root files but never finishes.

These 15 root files I created from ~1500 different root files. What do you think might be the issue? Is there any way I can find out whats going on?

with regards,
Ram


ROOT Version: ROOT 6.06/01
Platform: FNAL LPC
Compiler: Not Provided


Hi @ramkrishna,
I don’t know what could be going wrong (maybe @pcanal or @Axel have an idea), but one way you can check what the program is spending time on is by using a ROOT build with debug symbols (e.g. on lxplus you can find /cvmfs/sft.cern.ch/lcg/views/LCG_95a/x86_64-centos7-gcc8-dbg and similar (note the “dbg” at the end)), execute hadd through gdb (gdb --args hadd ...) and when you believe the program is stuck you can press ctrl-C and then type backtrace or thread apply all backtrace to see what each of the threads is doing. You can then type continue, let the program run for a bit, and then press ctrl-C again. This way you can monitor exactly what’s going on. If you see that threads are spending time doing useless work (or if you need help interpreting the stacktraces) you can copy-paste them here.

Cheers,
Enrico

Can you share one such ROOT file? Do you hadd mostly histograms or trees?

If it’s histograms it might well be the TDirectory registration - @pcanal (who’s currently on vacation), what are your plans with that?

@eguiraud: I tried

$ gdb --args hadd /tmp/rasharma/output.root temp_*.root
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6)
(gdb) 

But, it did not do any thing.

@Axel: the root files that I am trying to Hadd have only one tree each having many branches. You can find them here

/afs/cern.ch/user/r/rasharma/work/public/temp/root_files

You have to give the run command to start the program, forgot ato mention it :sweat_smile:

I strongly suggest looking up a gdb tutorial if you are going to do a lot of C++ coding, knowing to use a debugger is a huge quality of life improvement for a coder scientist.

Thanks @eguiraud for suggestions. I will learn the gdb.

I looked the gdb output but I currently unable to understand it. So, pasting it here:

$ gdb --args hadd /tmp/rasharma/output.root temp_*.root
(gdb) run
Starting program: /usr/bin/hadd /tmp/rasharma/output.root temp_0_temp.root temp_10_temp.root temp_11_temp.root temp_12_temp.root temp_13_temp.root temp_14_temp.root temp_1_temp.root temp_2_temp.root temp_3_temp.root temp_4_temp.root temp_5_temp.root temp_6_temp.root temp_7_temp.root temp_8_temp.root temp_9_temp.root
[Thread debugging using libthread_db enabled]
Detaching after fork from child process 21825.
hadd Target file: /tmp/rasharma/output.root
hadd Source file 1: temp_0_temp.root
hadd Source file 2: temp_10_temp.root
hadd Source file 3: temp_11_temp.root
hadd Source file 4: temp_12_temp.root
hadd Source file 5: temp_13_temp.root
hadd Source file 6: temp_14_temp.root
hadd Source file 7: temp_1_temp.root
hadd Source file 8: temp_2_temp.root
hadd Source file 9: temp_3_temp.root
hadd Source file 10: temp_4_temp.root
hadd Source file 11: temp_5_temp.root
hadd Source file 12: temp_6_temp.root
hadd Source file 13: temp_7_temp.root
hadd Source file 14: temp_8_temp.root
hadd Source file 15: temp_9_temp.root
hadd Target path: /tmp/rasharma/output.root:/



^C
Program received signal SIGINT, Interrupt.
0x00007ffff64a5b45 in memcpy () from /lib64/libc.so.6
(gdb) backtrace
#0  0x00007ffff64a5b45 in memcpy () from /lib64/libc.so.6
#1  0x00007ffff70c142e in TStorage::ReAllocChar(char*, unsigned long, unsigned long) ()
   from /usr/lib64/root/libCore.so.5.34
#2  0x00007ffff70807e1 in TBuffer::Expand(int, bool) () from /usr/lib64/root/libCore.so.5.34
#3  0x00007ffff793e2fb in TBufferFile::WriteUInt(unsigned int) () from /usr/lib64/root/libRIO.so.5.34
#4  0x00007ffff793cdd2 in TBufferFile::WriteObjectClass(void const*, TClass const*) ()
   from /usr/lib64/root/libRIO.so.5.34
#5  0x00007ffff7938e35 in TBufferFile::WriteObjectAny(void const*, TClass const*) ()
   from /usr/lib64/root/libRIO.so.5.34
#6  0x00007ffff71080db in TObjArray::Streamer(TBuffer&) () from /usr/lib64/root/libCore.so.5.34
#7  0x00007ffff793b8a9 in TBufferFile::WriteFastArray(void*, TClass const*, int, TMemberStreamer*) ()
   from /usr/lib64/root/libRIO.so.5.34
#8  0x00007ffff7ac4f08 in int TStreamerInfo::WriteBufferAux<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) () from /usr/lib64/root/libRIO.so.5.34
#9  0x00007ffff79ce83d in TStreamerInfoActions::GenericWriteAction(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) () from /usr/lib64/root/libRIO.so.5.34
#10 0x00007ffff79374d5 in TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) ()
   from /usr/lib64/root/libRIO.so.5.34
#11 0x00007ffff793d857 in TBufferFile::WriteClassBuffer(TClass const*, void*) () from /usr/lib64/root/libRIO.so.5.34
#12 0x00007ffff359d088 in TBranch::Streamer(TBuffer&) () from /usr/lib64/root/libTree.so.5.34
#13 0x00007ffff793cd3e in TBufferFile::WriteObjectClass(void const*, TClass const*) ()
   from /usr/lib64/root/libRIO.so.5.34
#14 0x00007ffff7938d94 in TBufferFile::WriteObjectAny(void const*, TClass const*) ()
   from /usr/lib64/root/libRIO.so.5.34
#15 0x00007ffff71080db in TObjArray::Streamer(TBuffer&) () from /usr/lib64/root/libCore.so.5.34
#16 0x00007ffff793b8a9 in TBufferFile::WriteFastArray(void*, TClass const*, int, TMemberStreamer*) ()
   from /usr/lib64/root/libRIO.so.5.34
#17 0x00007ffff7ac4f08 in int TStreamerInfo::WriteBufferAux<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) () from /usr/lib64/root/libRIO.so.5.34
#18 0x00007ffff79ce83d in TStreamerInfoActions::GenericWriteAction(TBuffer&, void*, TStreamerInfoActions::TConfiguratio---Type <return> to continue, or q <return> to quit---
n const*) () from /usr/lib64/root/libRIO.so.5.34
#19 0x00007ffff79374d5 in TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) ()
   from /usr/lib64/root/libRIO.so.5.34
#20 0x00007ffff793d857 in TBufferFile::WriteClassBuffer(TClass const*, void*) () from /usr/lib64/root/libRIO.so.5.34
#21 0x00007ffff35f204c in TTree::Streamer(TBuffer&) () from /usr/lib64/root/libTree.so.5.34
#22 0x00007ffff79a8421 in TKey::TKey(TObject const*, char const*, int, TDirectory*) ()
   from /usr/lib64/root/libRIO.so.5.34
#23 0x00007ffff797957a in TFile::CreateKey(TDirectory*, TObject const*, char const*, int) ()
   from /usr/lib64/root/libRIO.so.5.34
#24 0x00007ffff7971a17 in TDirectoryFile::WriteTObject(TObject const*, char const*, char const*, int) ()
   from /usr/lib64/root/libRIO.so.5.34
#25 0x00007ffff70a388e in TObject::Write(char const*, int, int) const () from /usr/lib64/root/libCore.so.5.34
#26 0x00007ffff798fc33 in TFileMerger::MergeRecursive(TDirectory*, TList*, int) () from /usr/lib64/root/libRIO.so.5.34
#27 0x00007ffff798d9ec in TFileMerger::PartialMerge(int) () from /usr/lib64/root/libRIO.so.5.34
#28 0x0000000000402e2a in main ()
(gdb) 




(gdb) continue
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00007ffff64a5b45 in memcpy () from /lib64/libc.so.6
(gdb) backtrace
#0  0x00007ffff64a5b45 in memcpy () from /lib64/libc.so.6
#1  0x00007ffff70c142e in TStorage::ReAllocChar(char*, unsigned long, unsigned long) ()
   from /usr/lib64/root/libCore.so.5.34
#2  0x00007ffff70807e1 in TBuffer::Expand(int, bool) () from /usr/lib64/root/libCore.so.5.34
#3  0x00007ffff793e2fb in TBufferFile::WriteUInt(unsigned int) () from /usr/lib64/root/libRIO.so.5.34
#4  0x00007ffff793cdd2 in TBufferFile::WriteObjectClass(void const*, TClass const*) ()
   from /usr/lib64/root/libRIO.so.5.34
#5  0x00007ffff7938e35 in TBufferFile::WriteObjectAny(void const*, TClass const*) ()
   from /usr/lib64/root/libRIO.so.5.34
#6  0x00007ffff71080db in TObjArray::Streamer(TBuffer&) () from /usr/lib64/root/libCore.so.5.34
#7  0x00007ffff793b8a9 in TBufferFile::WriteFastArray(void*, TClass const*, int, TMemberStreamer*) ()
   from /usr/lib64/root/libRIO.so.5.34
#8  0x00007ffff7ac4f08 in int TStreamerInfo::WriteBufferAux<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) () from /usr/lib64/root/libRIO.so.5.34
#9  0x00007ffff79ce83d in TStreamerInfoActions::GenericWriteAction(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) () from /usr/lib64/root/libRIO.so.5.34
#10 0x00007ffff79374d5 in TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) ()
   from /usr/lib64/root/libRIO.so.5.34
#11 0x00007ffff793d857 in TBufferFile::WriteClassBuffer(TClass const*, void*) () from /usr/lib64/root/libRIO.so.5.34
#12 0x00007ffff359d088 in TBranch::Streamer(TBuffer&) () from /usr/lib64/root/libTree.so.5.34
#13 0x00007ffff793cd3e in TBufferFile::WriteObjectClass(void const*, TClass const*) ()
   from /usr/lib64/root/libRIO.so.5.34
#14 0x00007ffff7938d94 in TBufferFile::WriteObjectAny(void const*, TClass const*) ()
   from /usr/lib64/root/libRIO.so.5.34
#15 0x00007ffff71080db in TObjArray::Streamer(TBuffer&) () from /usr/lib64/root/libCore.so.5.34
#16 0x00007ffff793b8a9 in TBufferFile::WriteFastArray(void*, TClass const*, int, TMemberStreamer*) ()
   from /usr/lib64/root/libRIO.so.5.34
#17 0x00007ffff7ac4f08 in int TStreamerInfo::WriteBufferAux<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) () from /usr/lib64/root/libRIO.so.5.34
#18 0x00007ffff79ce83d in TStreamerInfoActions::GenericWriteAction(TBuffer&, void*, TStreamerInfoActions::TConfiguratio---Type <return> to continue, or q <return> to quit---
n const*) () from /usr/lib64/root/libRIO.so.5.34
#19 0x00007ffff79374d5 in TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) ()
   from /usr/lib64/root/libRIO.so.5.34
#20 0x00007ffff793d857 in TBufferFile::WriteClassBuffer(TClass const*, void*) () from /usr/lib64/root/libRIO.so.5.34
#21 0x00007ffff35f204c in TTree::Streamer(TBuffer&) () from /usr/lib64/root/libTree.so.5.34
#22 0x00007ffff79a8421 in TKey::TKey(TObject const*, char const*, int, TDirectory*) ()
   from /usr/lib64/root/libRIO.so.5.34
#23 0x00007ffff797957a in TFile::CreateKey(TDirectory*, TObject const*, char const*, int) ()
   from /usr/lib64/root/libRIO.so.5.34
#24 0x00007ffff7971a17 in TDirectoryFile::WriteTObject(TObject const*, char const*, char const*, int) ()
   from /usr/lib64/root/libRIO.so.5.34
#25 0x00007ffff70a388e in TObject::Write(char const*, int, int) const () from /usr/lib64/root/libCore.so.5.34
#26 0x00007ffff798fc33 in TFileMerger::MergeRecursive(TDirectory*, TList*, int) () from /usr/lib64/root/libRIO.so.5.34
#27 0x00007ffff798d9ec in TFileMerger::PartialMerge(int) () from /usr/lib64/root/libRIO.so.5.34
#28 0x0000000000402e2a in main ()



(gdb) thread apply all backtrace

Thread 1 (Thread 0x7ffff7fc40e0 (LWP 21820)):
#0  0x00007ffff64a5b45 in memcpy () from /lib64/libc.so.6
#1  0x00007ffff70c142e in TStorage::ReAllocChar(char*, unsigned long, unsigned long) ()
   from /usr/lib64/root/libCore.so.5.34
#2  0x00007ffff70807e1 in TBuffer::Expand(int, bool) () from /usr/lib64/root/libCore.so.5.34
#3  0x00007ffff793e2fb in TBufferFile::WriteUInt(unsigned int) () from /usr/lib64/root/libRIO.so.5.34
#4  0x00007ffff793cdd2 in TBufferFile::WriteObjectClass(void const*, TClass const*) ()
   from /usr/lib64/root/libRIO.so.5.34
#5  0x00007ffff7938e35 in TBufferFile::WriteObjectAny(void const*, TClass const*) ()
   from /usr/lib64/root/libRIO.so.5.34
#6  0x00007ffff71080db in TObjArray::Streamer(TBuffer&) () from /usr/lib64/root/libCore.so.5.34
#7  0x00007ffff793b8a9 in TBufferFile::WriteFastArray(void*, TClass const*, int, TMemberStreamer*) ()
   from /usr/lib64/root/libRIO.so.5.34
#8  0x00007ffff7ac4f08 in int TStreamerInfo::WriteBufferAux<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) () from /usr/lib64/root/libRIO.so.5.34
#9  0x00007ffff79ce83d in TStreamerInfoActions::GenericWriteAction(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) () from /usr/lib64/root/libRIO.so.5.34
#10 0x00007ffff79374d5 in TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) ()
   from /usr/lib64/root/libRIO.so.5.34
#11 0x00007ffff793d857 in TBufferFile::WriteClassBuffer(TClass const*, void*) () from /usr/lib64/root/libRIO.so.5.34
#12 0x00007ffff359d088 in TBranch::Streamer(TBuffer&) () from /usr/lib64/root/libTree.so.5.34
#13 0x00007ffff793cd3e in TBufferFile::WriteObjectClass(void const*, TClass const*) ()
   from /usr/lib64/root/libRIO.so.5.34
#14 0x00007ffff7938d94 in TBufferFile::WriteObjectAny(void const*, TClass const*) ()
   from /usr/lib64/root/libRIO.so.5.34
#15 0x00007ffff71080db in TObjArray::Streamer(TBuffer&) () from /usr/lib64/root/libCore.so.5.34
#16 0x00007ffff793b8a9 in TBufferFile::WriteFastArray(void*, TClass const*, int, TMemberStreamer*) ()
   from /usr/lib64/root/libRIO.so.5.34
#17 0x00007ffff7ac4f08 in int TStreamerInfo::WriteBufferAux<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* ---Type <return> to continue, or q <return> to quit---
const*, int, int, int, int, int) () from /usr/lib64/root/libRIO.so.5.34
#18 0x00007ffff79ce83d in TStreamerInfoActions::GenericWriteAction(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) () from /usr/lib64/root/libRIO.so.5.34
#19 0x00007ffff79374d5 in TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) ()
   from /usr/lib64/root/libRIO.so.5.34
#20 0x00007ffff793d857 in TBufferFile::WriteClassBuffer(TClass const*, void*) () from /usr/lib64/root/libRIO.so.5.34
#21 0x00007ffff35f204c in TTree::Streamer(TBuffer&) () from /usr/lib64/root/libTree.so.5.34
#22 0x00007ffff79a8421 in TKey::TKey(TObject const*, char const*, int, TDirectory*) ()
   from /usr/lib64/root/libRIO.so.5.34
#23 0x00007ffff797957a in TFile::CreateKey(TDirectory*, TObject const*, char const*, int) ()
   from /usr/lib64/root/libRIO.so.5.34
#24 0x00007ffff7971a17 in TDirectoryFile::WriteTObject(TObject const*, char const*, char const*, int) ()
   from /usr/lib64/root/libRIO.so.5.34
#25 0x00007ffff70a388e in TObject::Write(char const*, int, int) const () from /usr/lib64/root/libCore.so.5.34
#26 0x00007ffff798fc33 in TFileMerger::MergeRecursive(TDirectory*, TList*, int) () from /usr/lib64/root/libRIO.so.5.34
#27 0x00007ffff798d9ec in TFileMerger::PartialMerge(int) () from /usr/lib64/root/libRIO.so.5.34
#28 0x0000000000402e2a in main ()
(gdb) 

Hi,
yes one needs a bit of knowledge of ROOT internals to interpret these stacktraces – as far as I can tell, they say that hadd is busy doing useful work!

@Axel, I think we might need your help here…

P.S.

I wanted to check whether the merge was slow for me too, but although I can see the files in the directory, I can’t access the files themselves.

Another question is…how slow is slow? What’s the expected runtime for a task like this? I’m not sure…

@eguiraud & @Axel

I would like to add one more information. When I tried to add them in chunk of 4 they hadd-ed together without any issue. Then again I am unable to hadd-ed these new root files to get only one root file.

Can you give us access to these files? fs setacl -acl system:anyuser read -dir /afs/cern.ch/user/r/rasharma/work/public/temp/root_files should work.

Now, you can access the files.

I can reproduce that it takes ages. It’s in total 19GB - and AFS is really slow, so it might just be an AFS issue. Indeed, running hadd j 4 shows that the processes have a CPU load of only a couple of %! You should probably move the files to /tmp (if there’s enough space) and merge there, or do the merge on your local desktop / uni computer - not on AFS.

@Axel: I already tried to do Hadd by putting the files in /tmp/<username> area at lxplus. But, the problem persists.

I did not tried it on my local PC. I will try and let you know what happens.

Hi @Axel: I tried on my local pc and it did not worked.

I also tried other thing on my local pc.

First I Hadd-ed 5 files each. Then I have total of 3 root files. Now I tried to Hadd these three files to get one root file. It worked with some error message:

(base) visitor-51206051:rootfilees ram$ hadd output___.root output_1.root output_2.root output_3.root 
hadd Target file: output___.root
hadd compression setting for all ouput: 1
hadd Source file 1: output_1.root
hadd Source file 2: output_2.root
hadd Source file 3: output_3.root
hadd Target path: output___.root:/
Error in <TBufferFile::WriteByteCount>: bytecount too large (more than 1073741822)
Error in <TBufferFile::WriteByteCount>: bytecount too large (more than 1073741822)

Also, when I tried to open the final created root file I got this message:

(base) visitor-51206051:rootfilees ram$ root -l output___.root 
.ls
root [0] 
Attaching file output___.root as _file0...
(TFile *) 0x7fb42fa00210
root [1] .ls
TFile**		output___.root	
 TFile*		output___.root	
  KEY: TTree	otree;1	otree
root [2] otree->Print()
Error in <TBufferFile::CheckByteCount>: object of class TObjArray read too many bytes: 1122432445 instead of 48690621
Warning in <TBufferFile::CheckByteCount>: TObjArray::Streamer() not in sync with data on file output___.root, fix Streamer()
Error in <TBufferFile::CheckByteCount>: object of class TTree read too few bytes: 48690858 instead of 48693857
******************************************************************************
*Tree    :otree     : otree                                                  *
*Entries : 191047414 : Total =   1463527708432 bytes  File  Size = 19219866309 *
*        :          : Tree compression factor =  77.16                       *
******************************************************************************
*Br    0 :LHEWeight : LHEWeight[1164]/F                                      *
*Entries :191047414 : Total  Size=892381278169 bytes  File Size  = 7132867852 *
*Baskets : 29531116 : Basket Size=      32000 bytes  Compression= 125.03     *
*............................................................................*

What do you say about this error?

The tree itself shouldn’t be the issue - it’s likely other data stored in the file: for those there’s a size limit of (IIRC) 1GB per key. As this seems to be part of the tree, somehow, it’s indeed very surprising. @pcanal ideas? Can we get access to the file?

The crash you see is probably fixed by [io] fix crash due to overflow in buffer length variable by ferdymercury · Pull Request #14627 · root-project/root · GitHub

The limitation itself is not yet addressed (Overcome 1GB size limit for IO buffers · Issue #6734 · root-project/root · GitHub)