How to debug RDataFrame event Loop segfault

Dear all,
I would like to understand if there is a standard approach to understand why ( after bookkeping many operations on a RDataFrame , several Take, Filter, Define,Histo1D,Histo2D for O(4500) ) and mapping the results to nested std::map<RResultPtr<double,bool,TH1D>> or RNode, i see crashes of the code with messages like the one posted at the end of the thread.
I tried reducing the complexity of the tasks ( bookkeping 100 for example) and it was working, i tried changing from EnableImplicitMT(10) to (4) on lxplus for the complex case and i still get the error.
The segfault happens inside the event loop when i call
*df.Count()
and i wonder if there are any tools or switch which can help me to understand what is going wrong.

I want to investigate why and which bookkeped operation is the source of the problem or if calling many Take blows up memory and trigger the segfault for example.
Any suggestion how to understand how to debug such problem would be helpful.

Thanks in advance
Renato

 *** Break *** segmentation violation



===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================

Thread 6 (Thread 0x7f7f7d3fe700 (LWP 32133)):
#0  0x00007f7f88f41ba9 in syscall () from /lib64/libc.so.6
#1  0x00007f7f8090260d in XrdSys::LinuxSemaphore::Wait (this=0x3b3aaa0) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/./XrdSys/XrdSysLinuxSemaphore.hh:161
#2  XrdCl::SyncQueue<XrdCl::JobManager::JobHelper>::Get (this=0x3b8c548) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/./XrdCl/XrdClSyncQueue.hh:67
#3  XrdCl::JobManager::RunJobs (this=0x3b8c530) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/XrdCl/XrdClJobManager.cc:146
#4  0x00007f7f80902939 in RunRunnerThread (arg=<optimized out>) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/XrdCl/XrdClJobManager.cc:33
#5  0x00007f7f89d99e65 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f7f88f4788d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f7f7c079700 (LWP 32132)):
#0  0x00007f7f88f41ba9 in syscall () from /lib64/libc.so.6
#1  0x00007f7f8090260d in XrdSys::LinuxSemaphore::Wait (this=0x3b3aaa0) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/./XrdSys/XrdSysLinuxSemaphore.hh:161
#2  XrdCl::SyncQueue<XrdCl::JobManager::JobHelper>::Get (this=0x3b8c548) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/./XrdCl/XrdClSyncQueue.hh:67
#3  XrdCl::JobManager::RunJobs (this=0x3b8c530) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/XrdCl/XrdClJobManager.cc:146
#4  0x00007f7f80902939 in RunRunnerThread (arg=<optimized out>) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/XrdCl/XrdClJobManager.cc:33
#5  0x00007f7f89d99e65 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f7f88f4788d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f7f77878700 (LWP 32131)):
#0  0x00007f7f88f41ba9 in syscall () from /lib64/libc.so.6
#1  0x00007f7f8090260d in XrdSys::LinuxSemaphore::Wait (this=0x3b3aaa0) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/./XrdSys/XrdSysLinuxSemaphore.hh:161
#2  XrdCl::SyncQueue<XrdCl::JobManager::JobHelper>::Get (this=0x3b8c548) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/./XrdCl/XrdClSyncQueue.hh:67
#3  XrdCl::JobManager::RunJobs (this=0x3b8c530) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/XrdCl/XrdClJobManager.cc:146
#4  0x00007f7f80902939 in RunRunnerThread (arg=<optimized out>) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/XrdCl/XrdClJobManager.cc:33
#5  0x00007f7f89d99e65 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f7f88f4788d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f7f73077700 (LWP 32130)):
#0  0x00007f7f89da0e5d in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f7f80e18d4d in XrdSysTimer::Wait (mills=<optimized out>) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/XrdSys/XrdSysTimer.cc:239
#2  0x00007f7f808a76a8 in XrdCl::TaskManager::RunTasks (this=0x3a8b210) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/XrdCl/XrdClTaskManager.cc:244
#3  0x00007f7f808a7789 in RunRunnerThread (arg=<optimized out>) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/XrdCl/XrdClTaskManager.cc:37
#4  0x00007f7f89d99e65 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f7f88f4788d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f7f72876700 (LWP 32129)):
#0  0x00007f7f88f47e63 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f7f80e13087 in XrdSys::IOEvents::PollE::Begin (this=0x3b7f980, syncsem=<optimized out>, retcode=<optimized out>, eTxt=<optimized out>) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/./XrdSys/XrdSysIOEventsPollE.icc:213
#2  0x00007f7f80e0f885 in XrdSys::IOEvents::BootStrap::Start (parg=0x7fff79026330) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/XrdSys/XrdSysIOEvents.cc:131
#3  0x00007f7f80e18538 in XrdSysThread_Xeq (myargs=0x3ba84b0) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/XrdSys/XrdSysPthread.cc:86
#4  0x00007f7f89d99e65 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f7f88f4788d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f7f95dfa5c0 (LWP 31968)):
#0  0x00007f7f88f0e469 in waitpid () from /lib64/libc.so.6
#1  0x00007f7f88e8bf12 in do_system () from /lib64/libc.so.6
#2  0x00007f7f88e8c2c1 in system () from /lib64/libc.so.6
#3  0x00007f7f95200533 in TUnixSystem::StackTrace() () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-gcc8-opt/lib/libCore.so
#4  0x00007f7f95202d84 in TUnixSystem::DispatchSignals(ESignals) () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-gcc8-opt/lib/libCore.so
#5  <signal handler called>
#6  0x00007f7e7750647c in ?? ()
#7  0x00007fff78d81000 in ?? ()
#8  0x0000000000000110 in ?? ()
#9  0x00007fff78d81020 in ?? ()
#10 0x00007f7e77505c0c in ?? ()
#11 0x00007fff78d81020 in ?? ()
#12 0x00007fff78d81120 in ?? ()
#13 0x00007fff78d81108 in ?? ()
#14 0x0000000000000110 in ?? ()
#15 0x00007fff78d81040 in ?? ()
#16 0x00007f7e77505333 in ?? ()
#17 0x0000000005ce0520 in ?? ()
#18 0x0000000000000000 in ?? ()
===========================================================


The lines below might hint at the cause of the crash.
You may get help by asking at the ROOT forum http://root.cern.ch/forum
Only if you are really convinced it is a bug in ROOT then please submit a
report at http://root.cern.ch/bugs Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#6  0x00007f7e7750647c in ?? ()
#7  0x00007fff78d81000 in ?? ()
#8  0x0000000000000110 in ?? ()
#9  0x00007fff78d81020 in ?? ()
#10 0x00007f7e77505c0c in ?? ()
#11 0x00007fff78d81020 in ?? ()
#12 0x00007fff78d81120 in ?? ()
#13 0x00007fff78d81108 in ?? ()
#14 0x0000000000000110 in ?? ()
#15 0x00007fff78d81040 in ?? ()
#16 0x00007f7e77505333 in ?? ()
#17 0x0000000005ce0520 in ?? ()
#18 0x0000000000000000 in ?? ()
===========================================================

_ROOT Version: 6.18/04
Platform: x86_64-centos7-gcc8-opt
Compiler: Not Provided

The code crashes also when i use DisableIMplicitMT() before running it but i still get

Thread 6 (Thread 0x7f1e0b93a700 (LWP 3364)):
#0  0x00007f1e16f49ba9 in syscall () from /lib64/libc.so.6
#1  0x00007f1e0eee160d in XrdSys::LinuxSemaphore::Wait (this=0x383cae0) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/./XrdSys/XrdSysLinuxSemaphore.hh:161
#2  XrdCl::SyncQueue<XrdCl::JobManager::JobHelper>::Get (this=0x299c378) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/./XrdCl/XrdClSyncQueue.hh:67
#3  XrdCl::JobManager::RunJobs (this=0x299c360) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/XrdCl/XrdClJobManager.cc:146
#4  0x00007f1e0eee1939 in RunRunnerThread (arg=<optimized out>) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/XrdCl/XrdClJobManager.cc:33
#5  0x00007f1e17da1e65 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f1e16f4f88d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f1e06658700 (LWP 3363)):
#0  0x00007f1e16f49ba9 in syscall () from /lib64/libc.so.6
#1  0x00007f1e0eee160d in XrdSys::LinuxSemaphore::Wait (this=0x383cae0) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/./XrdSys/XrdSysLinuxSemaphore.hh:161
#2  XrdCl::SyncQueue<XrdCl::JobManager::JobHelper>::Get (this=0x299c378) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/./XrdCl/XrdClSyncQueue.hh:67
#3  XrdCl::JobManager::RunJobs (this=0x299c360) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/XrdCl/XrdClJobManager.cc:146
#4  0x00007f1e0eee1939 in RunRunnerThread (arg=<optimized out>) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/XrdCl/XrdClJobManager.cc:33
#5  0x00007f1e17da1e65 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f1e16f4f88d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f1e01e57700 (LWP 3362)):
#0  0x00007f1e16f49ba9 in syscall () from /lib64/libc.so.6
#1  0x00007f1e0eee160d in XrdSys::LinuxSemaphore::Wait (this=0x383cae0) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/./XrdSys/XrdSysLinuxSemaphore.hh:161
#2  XrdCl::SyncQueue<XrdCl::JobManager::JobHelper>::Get (this=0x299c378) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/./XrdCl/XrdClSyncQueue.hh:67
#3  XrdCl::JobManager::RunJobs (this=0x299c360) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/XrdCl/XrdClJobManager.cc:146
#4  0x00007f1e0eee1939 in RunRunnerThread (arg=<optimized out>) at /mnt/build/jenkins/workspace/lcg_release_latest/BUILDTYPE/Release/COMPILER/gcc8binutils/LABEL/centos7/build/externals/xrootd-4.10.0/src/xrootd/4.10.0/src/XrdCl/XrdClJobManager.cc:33
#5  0x00007f1e17da1e65 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f1e16f4f88d in clone () from /lib64/libc.so.6

I added a progress bar and the code breaks before starting it so i think the crash happens in the compilation of the expressions i am passing around.
I don’t know how to understand what is going on, i tried reducing the amoung of histograms the code has to produce from O(4000) to O(100) and it works. I don’t know if I am hitting a liti case where the RDataFrame just crash because i use too much memory.

I probably have solved the issue.
It was somehow related to have Too many Filters around and “reusing” expressions.
After refactoring the code to avoid stuff like :

auto df_current = df.Define( "fullSelection" , mySelectionFormua).Filter("fullSelection")
df_current.Take("fullSelection")

which was done Many many times, the code is now working.

Hi,
happy to hear you solved the problem.
When no jitting is involved, you can debug an RDF program just like any other C++ program. Just-in-time compiled expressions, however, are not accessible to the debugger, so more rudimentary “black box” approaches like helper printouts are often needed. What I typically do is try to find a minimal reproducer and then remove all jitting so I can simply use a debugger.

Cheers,
Enrico

Hi @eguiraud, if a given selection uses saying >40 obserables, how one can remove the jitting and run it in compiled mode?

Also, is there any plans to allow easier debugging of those black boxes ? or what is the suggested approach to avoid headaches of debugging black boxes coming from this?

I mean,

If my selection is :

selection = "B_PT > 1000 && B_ETA> 2 && E1_PT> 400 && E2_PT >400 && TMath::Min( E1_P, E2_P) > 3000 && ....."

which has many many single branches used around, i am not 100% sure how one can avoid in-time compilation of the expression.

if a given selection uses saying >40 obserables, how one can remove the jitting and run it in compiled mode?

That would need to become a function with 40 parameters.

The only feature I have in mind that might help in these situations is a GetFilterCode or GetDefineCode method that takes the expression as string and prints out the corresponding C++ code that would be just-in-time compiled. It might come at some point this year. We are open to suggestions though!

Cheers,
Enrico

1 Like

Hi,
If I understood correctly m.b. for your case would be nice use lambda instead strings:

void Analysis() {

// Conditions for events
auto Calc(ROOT::VecOps::RVec<Float_t> &E2_PT) {
          //all of yours branches)   
  if (E2_PT < 0)
    return -1.;
}


  TFile *f = new TFile("file.Data()");

  ROOT::RDataFrame rd("events", fd);

  ROOT::RDF::TH1DModel histModel(col1Name.Data(), col1Name.Data(), bins, xLow,
                                 xUp);

  auto hist = rd.Define("fullSelection", Calc,
                {"E2_PT"}) //should be the same list that in lambda function
          .Filter("fullSelection>0").Data())
          .Histo1D(histModel,"fullSelection");
  
    hist->DrawClone(drawOption.Data());
  }

In case of

ROOT::DisableImplicitMT()

you can debug it like usual.

Hope it’ll help.

Good luck.

Boris

@bdrum, yes to debug one can minimize the amount of branches to use, But what if one has O(100) branches used in the selection ?
Write a lambda with 100 arguments seems to be an overkill, and i guess this is the typical use case of RDataFrame. I agree that to minimize the problem one can re-write the selection to be way simpler, but in my analysis framwork the selection is “generated” by some code , so i need to modify my analysis selection to effectively debug it.

Just a tough, it might be quite complex to implement but in principle in a lengthy expression one has almost always parenthesis around and operators && , || .
I don’t know how helpful can be but already “decouple a 100 branches” dependent selection with O(30) operators into small pieces could help to automatize and avoid in-time compilation, just separating selection bits around , defining intermediate columns which are cutted later. Don’t know if that’s a stupid idea.

The problem with that is, users can write any valid C++ in filter/define expression strings. And sometimes users do more complex things than concatenating a bunch of simple operations. So in general we would need to keep the just-in-time compilation mechanism (and as soon as strings are just-in-time compiled, debuggers are blind). We could add an additional mechanism to manually parse the simpler expressions, break them down in simple operations and call the right functions by hand, but that would probably result in extremely inefficient event loops.

If your framework is generating code anyway, you could generate full C++ functions instead of just the selection expression and then pass that (plus a vector of 100 branch names, also dynamically generated by your framework) to RDF.

Note that you are already making an advanced usage of RDF, so I am proposing to get even slightly more advanced in order to reap large benefits: debuggers will work, and compiling with optimizations will get you a very large speed-up (I have seen up to 4x), because jitted code is not compiler-optimized.

Cheers,
Enrico

@eguiraud which CMake options (add_compiler_option) should i need to enable to get those 4x speed ups?
-O3 ? or what else?

In cmake it’s enough to specify -DCMAKE_BUILD_TYPE=Release, which I think translates to -O3 (-O2 would be sufficient too). Note that the 4x is for RDF code that spends most of its time within just-in-time compiled code vs RDF code that spend most of its time within code compiled with optimizations. See e.g. here. As per the RDF docs, expression strings are handy but they have a performance penalty.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.