PROOF sub-merging bug?

Hi Rooters (PROOFers?),

   Here at SLAC we have a small PROOF cluster (8 machines, 98 cores), and in limit of high numbers of histograms I have been dealing with semi-random worker crashes as well as crashes during a very slow merging process. On the theory that maybe the slow merging was part of the issue, Shuwei from BNL suggested trying sub-merging enabled. Sub-merging seems to work fine with histograms/objects in the top-level directory of the ROOT file, but as soon as I add a TDirectoryFile object to the output file (I am using TProofOutputFile for output) the merging with sub-merging on seg-faults with this type of error:

===========================================================
#5 0x00002ab15eb6fa43 in TDirectoryFile::Get(char const*) ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libRIO.so
#6 0x00002ab15eb6acf1 in TDirectoryFile::GetDirectory(char const*, bool, char const*) ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libRIO.so
#7 0x00002ab16053d8e4 in TFileMerger::MergeRecursive(TDirectory*, TList*) ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libProofPlayer.so
#8 0x00002ab16053e27f in TFileMerger::MergeRecursive(TDirectory*, TList*) ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libProofPlayer.so
#9 0x00002ab16053d042 in TFileMerger::Merge(bool) ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libProofPlayer.so
#10 0x00002ab160569fd8 in TProofPlayerRemote::MergeOutputFiles() ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libProofPlayer.so
#11 0x00002ab16056f287 in TProofPlayerLite::Finalize(bool, bool) ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libProofPlayer.so
#12 0x00002ab160570195 in TProofPlayerLite::Process(TDSet*, char const*, char const*, long long, long long) ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libProofPlayer.so
#13 0x00002ab15f6dc437 in TProofLite::Process(TDSet*, char const*, char const*, long long, long long) ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libProof.so

Am I doing something wrong, or is this a bug? We are using ROOT 5.28a.

Thanks,

Bart

I should probably note that the merging with TDirectoryFile objects works fine with sub-merging off, and it does not matter if the TDirectoryFile is empty or not, it always crashes.

Hi,

No, it is not a known problem, but sub-merging is still somewhat experimental.
Unfortunately I am not currently at work, but I will have a look as soon as I am
back middle of next week.

G. Ganis

Hi,

Update:
I think I managed to reproduce the problem, but I have not yet understood what goes wrong.
I hope to have more news soon.

G. Ganis

Excellent, thanks for looking into this.

Hello,

Any updates on this?

-Bart

Hi,

I am sorry for the delay, I had some other things to follow and could not work much on this last week.
The solution is a bit tricky and has to do with the fact that for submergers we need a temporary set of intermediate files and this was not correctly handled (even when there was no crash).
I think I am close to have the fix and I am confident to be able to commit it later this afternoon (CET).

G. Ganis

Awesome. Is there any potential to have a 5.28a patch for this or would we just have to use the trunk/wait for the next tag?

Hi,

I have just uploaded a fix into the trunk and 5-28-00-patches. I hope you will be able to try it.

Since it is in 5-28-00-patches it will appear in the next tag on the branch, i.e. 5-28-00e, which will probably appear in the coming weeks. But we do not modify existing tags.

G. Ganis

Understood, that is great. Thanks a lot, and we will certainly try out the 5.28e.

The problem is certainly resolved for 5-30, thanks.

Looking at the code, using the option “ML” for merging with sub-mergers, no check is performed to see if the files are already local, right? I’m thinking of situations with multiple worker cores sharing a disk array, in those cases one would want not to perform a useless local copy operation for an already local file, but still copy the partially-merged files to the master for final merging. Maybe such a check could be added?