Hadd - corrupted entries with different target path and large files


I’ve encountered a problem with hadd but I could not open a bug report so I’m trying here.

Described problem shows the same behaviour with:
@lxplus (SLC 6.8 ), standard lxplus root version 5.34/36 and gcc

@lxplus7 (CENT OS 7.2.1511) 6.08.02 with gcc 4.8:
source /cvmfs/sft.cern.ch/lcg/contrib/gcc/4.8/x86_64-centos7-gcc48-opt/setup.sh
source /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.08.02/x86_64-centos7-gcc48-opt/root/bin/thisroot.sh

I noticed that in some cases I have corrupted entries in root files above a certain size, never with smaller files. With these larger files GetEntry delivers:

This is where my motivation for the file size comes from although I was able to read files up to 3 GB without problems. It’s always only one branch in a tree that is corrupted but what branch seems to be random. Not always the same tree.

I have many small root files (on AFS) coming from lxbatch jobs and they get merged with hadd automatically to AFS again. I can track back the entry number of the corrupted entry in the large file and go to this exact entry in the original small root file where the entry is not corrupted. Conclusion: the small files that come out of my routines are not corrupted!

I’m aware of a file size limitation of about 2 GB in the past, where the message above may come from, but it’s also not that because:
The original directory (= source) of the small files and the target directory (= final) for the hadd command is not the same. And by ‘not the same’ I mean that it’s a different partition/project space. I’m sorry if I’m lacking the correct term for it but both directories have 100 GB in size if that helps to establish the correct context. To clarify:

hadd final/finalrootfile.root source/*.root --> corrupted

hadd source/finalrootfile.root source/*.root --> NOT corrupted

If I move the finalrootfile.root to final/. afterwards it’s fine. So this is a clear indication for me that it must come from hadd or the system. hadd shows no error messages when creating the files.
I emphasize again that source and final directories are on different project spaces/partitions or what ever the correct term is.

Additional information:
If I hadd five “good” normal sized root files (approx 2 GB) - which for themselves are already final root files with respect to the context above - from EOS to AFS, reading this 10 GB file works without a problem. AFS to AFS seems to be the problem to me.

Just to put it on the table, I’m aware of bus errors from HTCondor to EOS/AFS at the moment but not from AFS to AFS itself.

I’m aware that this may be very hard to reproduce but any ideas?


PS: the midterm fix is obviously not to merge on the final directory but on the source and then move. The longterm fix is anyways to switch to EOS but it would still be interesting to understand the issue.

Dear Michi,

Do you still have a set of small files reproducing the problem?
Without a reproducer it will be basically impossible to debug.

G Ganis