TProof crash sometimes occuring when using compiled Tselector


Please read tips for efficient and successful posting and posting code

ROOT Version: 6.22
Platform: lxplus
Compiler: Not Provided


Dear experts,

I am experiencing a crash using TProof (Lite) with a compiled TSelector running on lxplus (root version 6.22)
due to a break segmentation violation that seems random in the sense that it may or may not occur.

When it occurs it is always the same error it occurs after having processed successfully all events and it should then terminate slave nodes and start merging. It occurs after having called the destructor of my the TSelector class (called Analyser) the destructor does nothing in my case.

I would say it occurs arround 1 or 2 times out of 10 tries in average running on exactly the same files.

I use a compiled TSelector, I attach the code in this post tproof_bug_reproducer.zip (377.2 KB)

The rest of the time the TSelector/Tchain process runs fine and there is no such error

I checked the logs of slaves it only occurs on one of the node
But I have no idea of what could cause this. And I don’t think it is coming from a leak of memory so I guess there is a setting of Tproof that I am missing

If you could have a look and if you have an hint or where this problem could come from it would be great.
Many thanks in advance

I put below the log of the problematic slave node.
It only occurs in one of the slave node the other log files are fine

23:30:19 30333 Wrk-0.0 | Info in <TProofServLite::Setup>: fWorkDir: /afs/cern.ch/user/b/bouquet/work/private/VHbb_branch/VHbb_test_tprocess/output/test_dir_20210321-233016/logs
23:30:19 30333 Wrk-0.0 | Info in <TProofServLite::SetupCommon>:  0 global package directories registered
23:30:20 30333 Wrk-0.0 | Info in <TProofServLite::HandleProcess>: selector obj for 'Analyser' found
23:30:20 30333 Wrk-0.0 | Info in <TProofServLite::HandleProcess>: calling fPlayer->Process() with selector object: Analyser
23:30:20 30333 Wrk-0.0 | Info in <TProofPlayerSlave::AssertSelector>: Processing via TSelector object
23:30:20 30333 Wrk-0.0 | Info in <TEventIter::TEventIter>: fPackets list 'ProcessedPackets_0.0' created
23:30:20 30333 Wrk-0.0 | Info in <TProofPlayerSlave::Process>: save partial results? 0  per-packet? 0

Calling SlaveBegin
m_outputdir_fullpath = ../output
m_analysisLepChannel = 1
23:30:20 30333 Wrk-0.0 | SvcMsg in <TProofPlayerSlave::CheckMemUsage>: Memory 472048 virtual 142288 resident event 0
23:30:20 30333 Wrk-0.0 | Info in <TEventIterTree::GetTrees>: the tree cache is in learning phase
Initializing branches
23:30:20 30333 Wrk-0.0 | Info in <TProofServLite::RestartComputeTime>: compute time restarted after 0.025289 secs (100 entries)
23:30:24 30333 Wrk-0.0 | SvcMsg in <TProofPlayerSlave::CheckMemUsage>: Memory 569676 virtual 206232 resident event 547416

Calling SlaveTerminate
Writing histograms
23:30:24 30333 Wrk-0.0 | *** Break ***: segmentation violation



===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x00007fe0eccb14fc in waitpid () from /lib64/libc.so.6
#1  0x00007fe0ecc2efb2 in do_system () from /lib64/libc.so.6
#2  0x00007fe0edcac404 in TUnixSystem::StackTrace() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#3  0x00007fe0edcae09a in TUnixSystem::DispatchSignals(ESignals) () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#4  <signal handler called>
#5  0x00007fe0d8a4c0b2 in TTreeReader::~TTreeReader() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libTreePlayer.so.6.22
#6  0x00007fe0d4c497b3 in Analyser::~Analyser() () from /afs/cern.ch/work/b/bouquet/private/VHbb_branch/VHbb_test_tprocess/build/libAnalyser.so
#7  0x00007fe0d7807cf3 in TProofPlayer::~TProofPlayer() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProofPlayer.so.6.22
#8  0x00007fe0d7818011 in TProofPlayerSlave::~TProofPlayerSlave() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProofPlayer.so.6.22
#9  0x00007fe0dcae15b2 in TProofServ::DeletePlayer() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProof.so.6.22.08
#10 0x00007fe0dcafa068 in TProofServ::HandleSocketInput(TMessage*, bool) () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProof.so.6.22.08
#11 0x00007fe0dcae8c8f in TProofServ::HandleSocketInput() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProof.so.6.22.08
#12 0x00007fe0dcafdae1 in TProofServLiteInputHandler::Notify() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProof.so.6.22.08
#13 0x00007fe0edcad5b5 in TUnixSystem::CheckDescriptors() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#14 0x00007fe0edcae6fa in TUnixSystem::DispatchOneEvent(bool) () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#15 0x00007fe0edbdb3b6 in TSystem::InnerLoop() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#16 0x00007fe0edbdc2a0 in TSystem::Run() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#17 0x00007fe0edb7dc5f in TApplication::Run(bool) () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#18 0x000000000040147e in main ()
===========================================================


The lines below might hint at the cause of the crash.
You may get help by asking at the ROOT forum https://root.cern.ch/forum
Only if you are really convinced it is a bug in ROOT then please submit a
report at https://root.cern.ch/bugs Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5  0x00007fe0d8a4c0b2 in TTreeReader::~TTreeReader() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libTreePlayer.so.6.22
#6  0x00007fe0d4c497b3 in Analyser::~Analyser() () from /afs/cern.ch/work/b/bouquet/private/VHbb_branch/VHbb_test_tprocess/build/libAnalyser.so
#7  0x00007fe0d7807cf3 in TProofPlayer::~TProofPlayer() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProofPlayer.so.6.22
#8  0x00007fe0d7818011 in TProofPlayerSlave::~TProofPlayerSlave() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProofPlayer.so.6.22
#9  0x00007fe0dcae15b2 in TProofServ::DeletePlayer() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProof.so.6.22.08
#10 0x00007fe0dcafa068 in TProofServ::HandleSocketInput(TMessage*, bool) () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProof.so.6.22.08
#11 0x00007fe0dcae8c8f in TProofServ::HandleSocketInput() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProof.so.6.22.08
#12 0x00007fe0dcafdae1 in TProofServLiteInputHandler::Notify() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProof.so.6.22.08
#13 0x00007fe0edcad5b5 in TUnixSystem::CheckDescriptors() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#14 0x00007fe0edcae6fa in TUnixSystem::DispatchOneEvent(bool) () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#15 0x00007fe0edbdb3b6 in TSystem::InnerLoop() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#16 0x00007fe0edbdc2a0 in TSystem::Run() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#17 0x00007fe0edb7dc5f in TApplication::Run(bool) () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#18 0x000000000040147e in main ()
===========================================================


23:30:25 30333 Wrk-0.0 | Error in <TProofServLite::HandleException>: caugth exception triggered by signal '1' while processing dset:'TDSet:Nominal', file:'/eos/home-b/bouquet/VHbbcc_results/VHbb_1L_overlap_v1//Reader_1L_33-05_e_InclusiveMerge_D_D/fetch/data-MVATree//qqWlvHbbJ_PwPy8MINLO-6.root' - check logs for possible stacktrace - last event: 195859

I this @ganis can help you.

Just to add that I made another test I run with tproof debug level = 4 and here is a comparison of a log files of a node for a working run and not working run

not_working_debug_level4.txt (382.0 KB) working_debug_level4.txt (404.3 KB)

Especially for the non working run it’s the node 0.9 that failed and here is the end of the log file
log_node_0.9.txt (8.8 KB)

As you can see the break segmentation occurs after output resuls are sent to the master node

10:51:59 13684 Wrk-0.9 | Info in <TProofPlayerSlave::Process>: Call Process(235030)
10:51:59 13684 Wrk-0.9 | Info in <TProofPlayerSlave::Process>: Call Process(235031)
TProofProgressStatus:0.9: Ents:(234413,11192), Bytes:77971118, Calls:251, Learn:0 s, Proc:(11.1,0.577) s, CPU:4.92 s
TProofProgressStatus:: Ents:(234413,11192), Bytes:77971118, Calls:251, Learn:0 s, Proc:(11.7,0.675) s, CPU:5.24 s
10:51:59 13684 Wrk-0.9 | Info in <TProofServLite::GetNextPacket>: cacheSize: 30000000, learnent: 100
10:51:59 13684 Wrk-0.9 | Info in <TProofPlayerSlave::SavePartialResults>: partial result saving disabled
10:51:59 13684 Wrk-0.9 | Info in <TProofServLite::GetNextPacket>: Done
10:51:59 13684 Wrk-0.9 | SvcMsg in <TProofPlayerSlave::CheckMemUsage>: Memory 574488 virtual 210376 resident event 234413
10:51:59 13684 Wrk-0.9 | Info in <TProofPlayerSlave::Process>: 234413 events processed
10:51:59 13684 Wrk-0.9 | Info in <TProofPlayerSlave::SavePartialResults>: partial result saving disabled
10:51:59 13684 Wrk-0.9 | Info in <TOutputListSelectorDataMap::Init()>: considering data member `fTree'
10:51:59 13684 Wrk-0.9 | Info in <TOutputListSelectorDataMap::Init()>: considering data member `fCurrent'
10:51:59 13684 Wrk-0.9 | Info in <TOutputListSelectorDataMap::Init()>: considering data member `fPrevious'
10:51:59 13684 Wrk-0.9 | Info in <TOutputListSelectorDataMap::Init()>: considering data member `fDirector'
10:51:59 13684 Wrk-0.9 | Info in <TOutputListSelectorDataMap::Init()>: considering data member `m_branchReader'
10:51:59 13684 Wrk-0.9 | Info in <TOutputListSelectorDataMap::Init()>: considering data member `fInput'
10:51:59 13684 Wrk-0.9 | Info in <TOutputListSelectorDataMap::Init()>: considering data member `fOutput'
10:51:59 13684 Wrk-0.9 | Info in <TOutputListSelectorDataMap::Init()>: Found 7 data members.
10:51:59 13684 Wrk-0.9 | Info in <TProofPlayerSlave::Process>: Call SlaveTerminate()

Calling SlaveTerminate
Writing histograms
10:51:59 13684 Wrk-0.9 | Info in <TProofServLite::TProofServ::Handleprocess>: worker 0.9 has finished processing with 3 objects in output list
10:51:59 13684 Wrk-0.9 | Info in <TProofServLite::HandleProcess>: controlled mode: worker 0.9 has finished, sizes sent to master
10:51:59 13684 Wrk-0.9 | Info in <TProofServLite::HandleProcess>: done
10:51:59 13684 Wrk-0.9 | Info in <TProofServLite::HandleSocketInput>: got type 1058 from 'unix'
10:51:59 13684 Wrk-0.9 | Info in <TProofServLite::HandleSocketInput>: processing message type 1058 from 'unix'
10:51:59 13684 Wrk-0.9 | Info in <TProofServLite::HandleSocketInput:kPROOF_SENDOUTPUT>: worker was asked to send output to master
10:51:59 13684 Wrk-0.9 | Info in <TProofServLite::SendResults>: enter
10:51:59 13684 Wrk-0.9 | Info in <TProofServLite::SendResults>: done
10:51:59 13684 Wrk-0.9 | *** Break ***: segmentation violation



===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x00007f55306874fc in waitpid () from /lib64/libc.so.6
#1  0x00007f5530604fb2 in do_system () from /lib64/libc.so.6
#2  0x00007f5531682404 in TUnixSystem::StackTrace() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#3  0x00007f553168409a in TUnixSystem::DispatchSignals(ESignals) () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#4  <signal handler called>
#5  0x00007f551c4220b2 in TTreeReader::~TTreeReader() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libTreePlayer.so.6.22
#6  0x00007f551861f7b3 in Analyser::~Analyser() () from /afs/cern.ch/work/b/bouquet/private/VHbb_branch/VHbb_test_tprocess/build/libAnalyser.so
#7  0x00007f551b1ddcf3 in TProofPlayer::~TProofPlayer() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProofPlayer.so.6.22
#8  0x00007f551b1ee011 in TProofPlayerSlave::~TProofPlayerSlave() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProofPlayer.so.6.22
#9  0x00007f55204b75b2 in TProofServ::DeletePlayer() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProof.so.6.22.08
#10 0x00007f55204d0068 in TProofServ::HandleSocketInput(TMessage*, bool) () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProof.so.6.22.08
#11 0x00007f55204bec8f in TProofServ::HandleSocketInput() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProof.so.6.22.08
#12 0x00007f55204d3ae1 in TProofServLiteInputHandler::Notify() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProof.so.6.22.08
#13 0x00007f55316835b5 in TUnixSystem::CheckDescriptors() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#14 0x00007f55316846fa in TUnixSystem::DispatchOneEvent(bool) () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#15 0x00007f55315b13b6 in TSystem::InnerLoop() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#16 0x00007f55315b22a0 in TSystem::Run() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#17 0x00007f5531553c5f in TApplication::Run(bool) () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#18 0x000000000040147e in main ()
===========================================================


The lines below might hint at the cause of the crash.
You may get help by asking at the ROOT forum https://root.cern.ch/forum
Only if you are really convinced it is a bug in ROOT then please submit a
report at https://root.cern.ch/bugs Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5  0x00007f551c4220b2 in TTreeReader::~TTreeReader() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libTreePlayer.so.6.22
#6  0x00007f551861f7b3 in Analyser::~Analyser() () from /afs/cern.ch/work/b/bouquet/private/VHbb_branch/VHbb_test_tprocess/build/libAnalyser.so
#7  0x00007f551b1ddcf3 in TProofPlayer::~TProofPlayer() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProofPlayer.so.6.22
#8  0x00007f551b1ee011 in TProofPlayerSlave::~TProofPlayerSlave() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProofPlayer.so.6.22
#9  0x00007f55204b75b2 in TProofServ::DeletePlayer() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProof.so.6.22.08
#10 0x00007f55204d0068 in TProofServ::HandleSocketInput(TMessage*, bool) () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProof.so.6.22.08
#11 0x00007f55204bec8f in TProofServ::HandleSocketInput() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProof.so.6.22.08
#12 0x00007f55204d3ae1 in TProofServLiteInputHandler::Notify() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libProof.so.6.22.08
#13 0x00007f55316835b5 in TUnixSystem::CheckDescriptors() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#14 0x00007f55316846fa in TUnixSystem::DispatchOneEvent(bool) () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#15 0x00007f55315b13b6 in TSystem::InnerLoop() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#16 0x00007f55315b22a0 in TSystem::Run() () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#17 0x00007f5531553c5f in TApplication::Run(bool) () from /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.08/x86_64-centos7-gcc48-opt/lib/libCore.so.6.22
#18 0x000000000040147e in main ()
===========================================================


10:52:01 13684 Wrk-0.9 | Error in <TProofServLite::HandleException>: caugth exception triggered by signal '1' while processing dset:'TDSet:Nominal', file:'/eos/home-b/bouquet/VHbbcc_results/VHbb_1L_overlap_v1//Reader_1L_33-05_e_InclusiveMerge_D_D/fetch/data-MVATree//qqWlvHbbJ_PwPy8MINLO-4.root' - check logs for possible stacktrace - last event: 235031
10:52:01 13684 Wrk-0.9 | Info in <TProofServLite::SendAsynMessage>: 0.9: caught exception triggered by signal '1' while processing dset:'TDSet:Nominal', file:'/eos/home-b/bouquet/VHbbcc_results/VHbb_1L_overlap_v1//Reader_1L_33-05_e_InclusiveMerge_D_D/fetch/data-MVATree//qqWlvHbbJ_PwPy8MINLO-4.root' - check logs for possible stacktrace - last event: 235031

I don’t understand though why it mentions m_branchReader which is a member of my Analyser class but I don’t want it to be passed to output and I don’t pass it to output

Hi,
Difficult to say, but related to the deletion of the TTreeReader.
Can you try, if not done already, by commenting out delete m_branchReader in SlaveTerminate ?

G Ganis

Hi @ganis,

I tried it already but I get also the crash

One thing that is troubling me is the fact that my
m_branchReader object which is a class I implemented ends up in the TOutputListSelectorDataMap
object like it seems to me ROOT Cling interpret this as something to merge and maybe delete on master?
Like maybe it tries to delete that object even if it is not defined on the master node
Could this explain the break segmentation at the merging step ?

Can you run this without Proof? Does it crash then, too? If not, can you run it outside Proof but with valgrind?

Hi @Axel , @ganis,

I tried several times (around 30 times) without proof and I do not observe crash
With proof the crash occurs one or 2 times out of 10 tries

So I think the crash only occurs when using proof

I also tried with only one slave node with proof and the crash occurs
when the crash occurs (whether using one or several workers) the message printed on the terminal is

Lite-0: merging output objects ... \ (1 workers still sending)    

And then it crashes
Could it be that there is a communication problem sometimes occurring between master and slave node?

Like a status report from slave node that sometimes is missed by the master node
and hence it creates the crash just because the master node missed that status report

I run valgrind without proof test here is the log
(quite large file I put here the link to my cernbox I think you would need to download it
CERNBox)

Here are the heap and leak summaries lines.
I am not familiar with valgrind
but the number of bytes still in use at exit does not seem super huge to me (considering that I am running over more than 2 millions events reading 40 branches per event)

==10523== HEAP SUMMARY:
==10523==     in use at exit: 51,720,624 bytes in 111,004 blocks
==10523==   total heap usage: 674,328 allocs, 563,324 frees, 1,995,478,219 bytes allocated
==10523== 
==10523== Searching for pointers to 111,004 not-freed blocks
==10523== Checked 59,577,112 bytes
==10523== LEAK SUMMARY:
==10523==    definitely lost: 7,040 bytes in 44 blocks
==10523==    indirectly lost: 0 bytes in 0 blocks
==10523==      possibly lost: 345,968 bytes in 5,830 blocks
==10523==    still reachable: 51,367,616 bytes in 105,130 blocks
==10523==                       of which reachable via heuristic:
==10523==                         stdstring          : 1,268,902 bytes in 24,173 blocks
==10523==                         newarray           : 24,408 bytes in 41 blocks
==10523==         suppressed: 0 bytes in 0 blocks
==10523== 
==10523== ERROR SUMMARY: 16557 errors from 378 contexts (suppressed: 0 from 0)

I am not really familiar with valgrind
I used the following command options for valgrind

valgrind --leak-check=full \
         --show-leak-kinds=all \
         --track-origins=yes \
         --verbose \
         --log-file=valgrind-out.txt 

Dear Romain,
The problem may be in PROOF, which is in legacy mode since a while, therefore tested through versions changes, or in the ‘wrapper’ around TTreeReader that tou have implemented.
This is not easy to debug, especially without a debug build (running valgrind on an optimized build does not bring much).
`PROOF-Lite

PROOF-Lite has been replaced by RDataFrame. I wonder if it would not be better to invest the time to move to it instead of debugging an old technology.
Your code would also probably look much simpler.
The documentation in Data frames - ROOT and the example ROOT: tutorials/dataframe/df101_h1Analysis.C File Reference (and others in the same directory) can show how to translate an analysis running on a TChain and TSelector into RDataFrame.

G Ganis

Hi @ganis,
Alright thanks for your advice,
I am not sure it could be done with RDataFramw because the selection that I want to do is not super simple
but I will have a look

(just as a side-note, RDataFrame supports arbitrarily complex selections: you can use any and all C++ code in Filters and Defines)

As in: you can have a whole C++ library doing the selections :slight_smile: We just use strings to demonstrate simple use cases. But e.g. tutorials/dataframe/df103_NanoAODHiggsAnalysis.C shows the full power of how you can stitch complex procedures together.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.