0: caught exception triggered by signal '1' while merging ob

Hi,

What does the message in the subject really means ?

Could it signals a memory problem ? That would appear only in the merging phase -and only past a given number of events. I mean it works for, let’s say, 10 datasets, but not for 11.

How can I debug this ?

Thanks,

Yes, this may (is likely to) indicate a memory problem.
It depends on what your output is.

The best would be to run the master within valgrind. But you need a debug build for that. See root.cern.ch/drupal/content/runn … y-valgrind .
Otherwise you can check the size of your output when it works(for 5 and 10 datasets, for example) to understand your needs when working with larger datasets. That may give an hint of what is going on.

Gerri

Hi Gerri,

I realize the subject of my post was truncated, it’s :

Error in TXProofServ::HandleException: caugth exception triggered by signal ‘1’ while merging object ‘PROOF_TOutputListSelectorDataMap_object’

My output is only histograms. I tried also w/o any output, and I get the same issue… Will try to see if I can valgrind though

Regards

Hi Gerri,

Unfortunately the AF(s) I have access to are not using root in debug mode, so I’m afraid I cannot use valgrind.

What is the PROOF_TOutputListSelectorDataMap_object ? Can it be optionally de-activated for instance or is it absolutely vital to have it ?

Thanks,

Is not vital but it cannot be deactivated.
However, it would be very strange if it would create memory problems.
How many histograms do you have in you output?

Gerri

Hi Gerri,

For the tests I’m doing now, I have nothing in the output. I run only an empty task (AliAnalysisTaskBaseLine to name it).

Well,
There is something wrong with the AliAnalysis framework.
An empty, non framework, PROOF task on CAF does not have this problem. I’ve run it also yesterday.
I think we need absolutely a full-debug version of AliRoot and ROOT on AFs and do some thorough valgrind runs under various conditions.

Cheers, Gerri

Hi Gerri,

May I ask how you tested it on CAF ? Was it on Alice data ?

I tried with a “simple” selector with as less Alice specifics as possible (still need to AliAODEvent to be able to read AOD data though), and I still get the issue.

I’m attaching my selector here in case (quite probable :wink: ) I’m doing something obviously stupid.
EventNumberChecks.cxx (3.58 KB)
EventNumberChecks.h (788 Bytes)

No, it was an non-data selector with many histograms.

Ok, I’ll try this one.
Btw, from a first look I see that you set ClassDef(…,4): why 4? TSelectors are not streamable classes, that must be 0.

Gerri

Quick question: which AoD dataset are you using? Are all equivalent?

Gerri

Hi Gerri,

I’m using (muon) AODs : /alice/data/LHC11h__p1_muon_AOD_AliMuons (which are lighter than std aods, and hence would allow, in principle, to analyse the full PbPb period in one go)

Concerning the ClassDef, I thought the TSelector was streamed in some way, that’s why :wink:

BTW, we now have a root debug version available on CAF and SAF. On SAF we have the valgrind version that comes with SL5.3 (valgrind 3.2.1), i.e. pretty old. Must we get a more recent one ?

Hi,

I wanted to re-start debugging this (yet, I know, quite a looong time after the fact…) and discovered the problem has actually disappeared. Don’t know if proof improved or Alice analysis framework did (or both ?) but I’m now a happy camper.

Regards,

Hi,

With which version did you retry?

There have been some recent fixes (thanks to B. Butler) which could affect your issues but for the moment they are only in the trunk; I plan to port them to the relevant patch branches these days.

Gerri

Hi Ganis,

I realize I’ve never answered this one… Which is kind of unfortunate because it seems the problem reappeared somehow. It’s been working for a while (with root 5.33) and we just switched to root 5.34/02 and we’re back to the issue… (tried only on SAF so far, will try on CAF as well asap)

Any idea what could be the reason ?

Thanks

Hello,

Looks like I’m facing the same problem.

Proof-Lite workers randomly crash during TProofOutputFile merging.

~/.proof/newbean-bean-workdir/last-lite-session/worker-0.2.log:

22:16:53 28123 Wrk-0.2 | SvcMsg in <TProofPlayerSlave::CheckMemUsage>: Memory 448736 virtual 111476 resident event 1000
22:16:54 28123 Wrk-0.2 | SvcMsg in <TProofPlayerSlave::CheckMemUsage>: Memory 448736 virtual 111492 resident event 1000
22:16:54 28123 Wrk-0.2 | *** Break ***: segmentation violation



===========================================================
There was a crash (kSigSegmentationViolation).
This is the entire stack trace of all threads:
===========================================================
#0  0x00007f850eb6745e in __libc_waitpid (pid=<value optimized out>, stat_loc=0x7fff990e7c2c, options=<value optimized out>) at ../sysdeps/unix/sysv/linux/waitpid.c:32
#1  0x00007f850eafca99 in do_system (line=<value optimized out>) at ../sysdeps/posix/system.c:149
#2  0x00007f850fc6fbc6 in TUnixSystem::Exec (this=0x168ef60, shellcmd=0x34ec000 "/opt/root_trunk/etc/gdb-backtrace.sh 28123 1>&2") at /opt/root_trunk/core/unix/src/TUnixSystem.cxx:2067
#3  0x00007f850fc704b6 in TUnixSystem::StackTrace (this=0x168ef60) at /opt/root_trunk/core/unix/src/TUnixSystem.cxx:2315
#4  0x00007f850fc6dde5 in TUnixSystem::DispatchSignals (this=0x168ef60, sig=kSigSegmentationViolation) at /opt/root_trunk/core/unix/src/TUnixSystem.cxx:1198
#5  0x00007f850fc6bb89 in SigHandler (sig=kSigSegmentationViolation) at /opt/root_trunk/core/unix/src/TUnixSystem.cxx:356
#6  0x00007f850fc73bac in sighandler (sig=11) at /opt/root_trunk/core/unix/src/TUnixSystem.cxx:3510
#7  <signal handler called>
#8  0x00007f850f2d9248 in __dynamic_cast () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f850c36c79b in TFile::Close (this=0x1f71be0, option=0x7f8507ed74a0 "") at /opt/root_trunk/io/io/src/TFile.cxx:885
#10 0x00007f8507ebf148 in ReadDst::SlaveTerminate (this=0x1f86880) at /home/boger/newbean/bean/BeanCore/ReadDst.cxx:346
#11 0x00007f8507bfc0b9 in TProofPlayer::Process (this=0x1f6fac0, dset=0x1f1b260, selector_file=0x7fff990eaed9 "ReadDst", option=0x7fff990eaeb9 "", nentries=-1, first=-1) at /opt/root_trunk/proof/proofplayer/src/TProofPlayer.cxx:1381
#12 0x00007f850b0e0292 in TProofServ::HandleProcess (this=0x1bf6550, mess=0x1cf8580, slb=0x0) at /opt/root_trunk/proof/proof/src/TProofServ.cxx:3974
#13 0x00007f850b0d2b1a in TProofServ::HandleSocketInput (this=0x1bf6550, mess=0x1cf8580, all=true) at /opt/root_trunk/proof/proof/src/TProofServ.cxx:1629
#14 0x00007f850b0d11de in TProofServ::HandleSocketInput (this=0x1bf6550) at /opt/root_trunk/proof/proof/src/TProofServ.cxx:1352
#15 0x00007f850b0f778b in TProofServLiteInputHandler::Notify (this=0x1bf9990) at /opt/root_trunk/proof/proof/src/TProofServLite.cxx:163
#16 0x00007f850b0fa7e5 in TProofServLiteInputHandler::ReadNotify (this=0x1bf9990) at /opt/root_trunk/proof/proof/src/TProofServLite.cxx:155
#17 0x00007f850fc6e160 in TUnixSystem::CheckDescriptors (this=0x168ef60) at /opt/root_trunk/core/unix/src/TUnixSystem.cxx:1293
#18 0x00007f850fc6d3d4 in TUnixSystem::DispatchOneEvent (this=0x168ef60, pendingOnly=false) at /opt/root_trunk/core/unix/src/TUnixSystem.cxx:1007
#19 0x00007f850fbc3af1 in TSystem::InnerLoop (this=0x168ef60) at /opt/root_trunk/core/base/src/TSystem.cxx:408
#20 0x00007f850fbc3872 in TSystem::Run (this=0x168ef60) at /opt/root_trunk/core/base/src/TSystem.cxx:358
#21 0x00007f850fb4969e in TApplication::Run (this=0x1bf6550, retrn=false) at /opt/root_trunk/core/base/src/TApplication.cxx:1044
#22 0x00007f850b0d724f in TProofServ::Run (this=0x1bf6550, retrn=false) at /opt/root_trunk/proof/proof/src/TProofServ.cxx:2526
#23 0x00000000004027af in main (argc=6, argv=0x7fff990ec5b8) at /opt/root_trunk/main/src/pmain.cxx:325
===========================================================



The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#8  0x00007f850f2d9248 in __dynamic_cast () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f850c36c79b in TFile::Close (this=0x1f71be0, option=0x7f8507ed74a0 "") at /opt/root_trunk/io/io/src/TFile.cxx:885
#10 0x00007f8507ebf148 in ReadDst::SlaveTerminate (this=0x1f86880) at /home/boger/newbean/bean/BeanCore/ReadDst.cxx:346
#11 0x00007f8507bfc0b9 in TProofPlayer::Process (this=0x1f6fac0, dset=0x1f1b260, selector_file=0x7fff990eaed9 "ReadDst", option=0x7fff990eaeb9 "", nentries=-1, first=-1) at /opt/root_trunk/proof/proofplayer/src/TProofPlayer.cxx:1381
#12 0x00007f850b0e0292 in TProofServ::HandleProcess (this=0x1bf6550, mess=0x1cf8580, slb=0x0) at /opt/root_trunk/proof/proof/src/TProofServ.cxx:3974
#13 0x00007f850b0d2b1a in TProofServ::HandleSocketInput (this=0x1bf6550, mess=0x1cf8580, all=true) at /opt/root_trunk/proof/proof/src/TProofServ.cxx:1629
#14 0x00007f850b0d11de in TProofServ::HandleSocketInput (this=0x1bf6550) at /opt/root_trunk/proof/proof/src/TProofServ.cxx:1352
#15 0x00007f850b0f778b in TProofServLiteInputHandler::Notify (this=0x1bf9990) at /opt/root_trunk/proof/proof/src/TProofServLite.cxx:163
#16 0x00007f850b0fa7e5 in TProofServLiteInputHandler::ReadNotify (this=0x1bf9990) at /opt/root_trunk/proof/proof/src/TProofServLite.cxx:155
#17 0x00007f850fc6e160 in TUnixSystem::CheckDescriptors (this=0x168ef60) at /opt/root_trunk/core/unix/src/TUnixSystem.cxx:1293
#18 0x00007f850fc6d3d4 in TUnixSystem::DispatchOneEvent (this=0x168ef60, pendingOnly=false) at /opt/root_trunk/core/unix/src/TUnixSystem.cxx:1007
#19 0x00007f850fbc3af1 in TSystem::InnerLoop (this=0x168ef60) at /opt/root_trunk/core/base/src/TSystem.cxx:408
#20 0x00007f850fbc3872 in TSystem::Run (this=0x168ef60) at /opt/root_trunk/core/base/src/TSystem.cxx:358
#21 0x00007f850fb4969e in TApplication::Run (this=0x1bf6550, retrn=false) at /opt/root_trunk/core/base/src/TApplication.cxx:1044
#22 0x00007f850b0d724f in TProofServ::Run (this=0x1bf6550, retrn=false) at /opt/root_trunk/proof/proof/src/TProofServ.cxx:2526
#23 0x00000000004027af in main (argc=6, argv=0x7fff990ec5b8) at /opt/root_trunk/main/src/pmain.cxx:325
===========================================================


22:16:55 28123 Wrk-0.2 | Error in <TProofServLite::HandleException>: caugth exception triggered by signal '1' while processing dset:'TDSet:Event', file:'/home/boger/data/mc/662/gen/alld_inc/alld_inc_9_10_4.dst' - check logs for possible stacktrace - last event: 999

The exactly same code works flawlessly with ROOT v532 but started to crash with both ROOT v534-3 and trunk (47040). If no TProofOutputFile is used, everything works fine.

The exact stack trace is somewhat random, sometimes the segfault happens in the analysis code, but sometimes also in TCollection::GarbageCollect, TFile::Close() or even in ~TTree().

Do you have any ideas?

Hi,

You seem to have a ROOT installation compiled in debug mode: can you run with valgrind?
See root.cern.ch/drupal/content/runn … y-valgrind .
Basically, start PROOF-Lite with

   p = TProof::Open("workers=2","valgrind")

and check the .valgrind files in the log window.

Gerri

Sure.

Please find attached the worker log files with and without “PROOF_WRAPPERCMD=valgrind_opts:–leak-check=full” enabled.
bugreport.tar.gz (28 KB)

Hi,

The problem seems to be in the cache.
Are you explicitly enabling the cache?

Can you try by adding a call to TFile::SetCacheRead(0) before the call to Close() in ReadDst::SlaveTerminate?
E.g.

   file->SetCacheRead(0);
   file->Close();                    // <<<---- Current line 346 of ReadDst.cxx

(use of course the relevant TFIle* at the place of ‘file’).

Gerri

Hi,

We’re explicitly enabling cache via

      chain.SetCacheSize(TREE_CACHE_SIZE);
      chain.AddBranchToCache("*",kTRUE);

Unfortunately, neither disabling this code nor adding TFile::SetCacheRead(0) seems to help.

Hello,

Do you have any ideas?