Proof merge crashed for more workers but OK for less workers

yesw2000 · September 28, 2010, 2:32am

Hi,

We at BNL experienced a weird problem with Proof job crashed during merging stage with more workers (says 50 workers) and fine with less workers (says 25 workers). The number of workers at which the job crashed varied dynamically. I monitored the memory consumption on the master node and found that the job reached 2GB prior to the crash. But the job ran fine on Proof-Lite with any number of workers, though the memory by Proof-lite reached beyond 2.6GB. I just ran Proof in Valgrind mode and got the following message:

==1151== Conditional jump or move depends on uninitialised value(s)
==1151== at 0x72286D0: (within /usr/lib64/libz.so.1.2.3)
==1151== by 0x7229C4A: (within /usr/lib64/libz.so.1.2.3)
==1151== by 0x7228C6F: deflate (in /usr/lib64/libz.so.1.2.3)
==1151== by 0x50B5216: R__zip (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libCore.so)
==1151== by 0x8C5A46B: TKey::TKey(TObject const*, char const*, int, TDirectory*) (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libRIO.so)
==1151== by 0x8C414C8: TFile::CreateKey(TDirectory*, TObject const*, char const*, int) (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libRIO.so)
==1151== by 0x8C380F7: TDirectoryFile::WriteTObject(TObject const*, char const*, char const*, int) (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libRIO.so)
==1151== by 0x4FFE5E7: TObject::Write(char const*, int, int) const (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libCore.so)
==1151== by 0xC08A1D3: TFileMerger::MergeRecursive(TDirectory*, TList*) (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libProofPlayer.so)
==1151== by 0xC088F61: TFileMerger::Merge(bool) (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libProofPlayer.so)
==1151== by 0xC0B0A17: TProofPlayerRemote::MergeOutputFiles() (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libProofPlayer.so)
==1151== by 0xC0B1101: TProofPlayerRemote::Finalize(bool, bool) (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libProofPlayer.so)
1151 new/new failed and should throw an exception, but Valgrind
cannot throw exceptions and so is aborting instead. Sorry.
==1151== at 0x4C20AE3: VALGRIND_PRINTF_BACKTRACE (valgrind.h:2366)
==1151== by 0x4C20CF6: operator new(unsigned long) (vg_replace_malloc.c:199)
==1151== by 0x50512C2: TArrayF::Set(int) (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libCore.so)
==1151== by 0x5050E9E: TArrayF::Streamer(TBuffer&) (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libCore.so)
==1151== by 0x53BFBF4: TArrayF::StreamerNVirtual(TBuffer&) (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libCore.so)
==1151== by 0x53A3DA5: _ZL18G__G__Cont_98_0_28P8G__valuePKcP8G__parami (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libCore.so)
==1151== by 0x58E148B: Cint::G__CallFunc::Execute(void*) (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libCint.so)
==1151== by 0x5070AEB: TCint::CallFunc_Exec(void*, void*) const (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libCore.so)
==1151== by 0x5098AD4: TMethodCall::Execute(void*) (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libCore.so)
==1151== by 0x50A2A03: TStreamerBase::ReadBuffer(TBuffer&, char*) (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libCore.so)
==1151== by 0x8CD37D8: int TStreamerInfo::ReadBuffer<char**>(TBuffer&, char** const&, int, int, int, int) (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libRIO.so)
==1151== by 0x8C2BDC6: TBufferFile::ReadClassBuffer(TClass const*, void*, int, unsigned, unsigned, TClass const*) (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libRIO.so)
–1151-- Discarding syms at 0xA84A000-0xAA55000 in /lib64/libnss_files-2.5.so due to munmap()
==1151==
==1151== ERROR SUMMARY: 4553 errors from 8 contexts (suppressed: 5 from 1)
==1151==
==1151== 559 errors in context 1 of 8:
==1151== Conditional jump or move depends on uninitialised value(s)
==1151== at 0x72286C2: (within /usr/lib64/libz.so.1.2.3)
==1151== by 0x7229C4A: (within /usr/lib64/libz.so.1.2.3)
==1151== by 0x7228C6F: deflate (in /usr/lib64/libz.so.1.2.3)
==1151== by 0x50B5216: R__zip (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libCore.so)
==1151== by 0x8C5A46B: TKey::TKey(TObject const*, char const*, int, TDirectory*) (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libRIO.so)
==1151== by 0x8C414C8: TFile::CreateKey(TDirectory*, TObject const*, char const*, int) (in /afs/usatlas.bnl.gov/cernsw/lcg/app/releases/ROOT/5.27.04/x86_64-slc5-gcc43-opt/root/lib/libRIO.so)

According to the error message, it looks like that it could not create new object. Any idea? Many users at BNL are experiencing the same problem, and your kind assistance on resolving this problem would be greatly appreciated.

The ROOT version on the TProof server in use is 5.27.04, and OS is SL5.

–Shuwei

ganis · September 30, 2010, 11:05am

Dear Shuwey,

Your valgrind output is with an optimized version of ROOT, so the only really meaningful information is the fact that ‘new’ fails. This points to a memory issue.
What objects are you merging? Or, do you expect a large memory usage from the objects that you merge?
Depending on the object type, TFileMerger may load everything in memory (hists, for example), so you have to multiply the expected output size by N_wrks …
If I understand correctly what crashes is the master and you seem to have a 2GB memory limit on the master machine (a ulimit setting?).
How many workers did you run in PROOF-Lite?

Gerri

yesw2000 · September 30, 2010, 1:19pm

Hi Gerri,

Thanks for your reply.

There are many 2D hists for merge via TProofOutputFile. Currently we have 96 workers, and the job started to crash at about 40 workers. But this number of workers varies dynamically. Sometimes it crashed at 50 workers, sometimes at 30 workers.

In Proof-Lite mode on the master node, I tried “workers=96”, the same number as on Proof , and the job still ran well though the memory consumption by root.exe reached 2.6GB, beyong 2GB.

The fact that the merge runs fine on Proof-Lite but crashes on Proof puzzles me. What difference on file merge between Proof-Lite and Proof? How can I further investigate this problem?

–Shuwei

ganis · September 30, 2010, 4:17pm

Hi Shuwey,

My interpretation is that the master inherits a memory limit which is not active in the PROOF-Lite session. Perhaps I am wrong, but we do not have yet all the elements to rule out that.

So, in my opinion, the first thing to do is to take the size of the output that you get when it works and make some simple calculation to estimate the transient memory needed as a function of the number of workers. This is to see if you can roughly understand the number of workers (about 40) at which it crashes.

Then you should check the xrootd or xproofd startup scripts for any ulimit setting. Do you have access to those scripts? Do you have access to the xrootd or xproofd configuration file? If yes, could you post these files?

Gerri

yesw2000 · September 30, 2010, 5:15pm

Hi Gerri

The proofd configuration is:

And xrootd configuration is:

The final merged (and good) output root file is only 640 KB, but the compression factor is about 200.

–Shuwei

yesw2000 · October 11, 2010, 8:48pm

Hi,

It finally turned out that there are 2GB limit on virtual memory and 1GB on resident memory set in /etc/rc.d/init.d/xproofd. After having remove those limits, the crash with merging is gone.

–Shuwei