Segmentation violation when snapshotting a large RDataFrame

When trying to make a snapshot of a very large RDataFrame ROOT segfaults, I am using ROOT 6.16/00 on slc6 as provided by the LCG. Specifically it crashes when trying to merge /DoubleMuon/Run2017B-31Mar2018-v1/NANOAOD or /SingleMuon/Run2017B-31Mar2018-v1/NANOAOD into a single file (usually of course we would be applying an event selection, this is just for a MWE).

MWE:

#include <ROOT/RDataFrame.hxx>

int main(int argc, char* argv[])
{
    // ROOT::EnableImplicitMT();  // makes no difference
    const std::vector<std::string> files{"path/to/NANOAOD/*.root"};
    ROOT::RDataFrame d{"Events", files};
    d.Snapshot("Events", "skim.root");
}

When trying this ROOT crashes with

 *** Break *** segmentation violation                                                                                                                                                                                                                                                                                                                                        
    __boot()                                                                                                                                                                                                                                                                                                                                                                 
    import os         

It it possible I am hitting some tree/file size limit that I must increase? The crash seems to occur once the output file has reached ~1GB in both cases.


ROOT Version: 6.16/00 LCG_95 x86_64-slc6-gcc8-opt
Platform: slc6
Compiler: gcc 8.2


What happens if you don’t use { } but ( ) to initialize the RDataFrame, and use a different name for the output tree, like "Events_out"? Does ROOT still crash? You can also try to create a TChain beforehand and use that to create the RDataFrame. Since we don’t have the files, would you be able to post a full stack trace of the crash?

I have tried using (), a TChain, and a different name for the TTree with no luck. Here is a backtrace from gdb:

#0  0x00007ffff5701e88 in ROOT::Internal::TTreeReaderValueBase::EReadStatus ROOT::Internal::TTreeReaderValueBase::ProxyReadTemplate<&ROOT::Detail::TBranchProxy::ReadNoParentNoBranchCountNoCollection>() ()
   from /cvmfs/sft.cern.ch/lcg/views/LCG_95/x86_64-slc6-gcc8-opt/lib/libTreePlayer.so
#1  0x00007ffff56fe2ad in ROOT::Internal::TTreeReaderValueBase::GetAddress() () from /cvmfs/sft.cern.ch/lcg/views/LCG_95/x86_64-slc6-gcc8-opt/lib/libTreePlayer.so
#2  0x0000000000561eea in TTreeReaderValue<bool>::Get (this=0x9a1e270) at /cvmfs/sft.cern.ch/lcg/views/LCG_95/x86_64-slc6-gcc8-opt/include/TTreeReaderValue.h:152
#3  0x0000000000560b7a in ROOT::Internal::RDF::RColumnValue<bool>::Get<bool, 0> (this=0x98e43d0, entry=1613076) at /cvmfs/sft.cern.ch/lcg/views/LCG_95/x86_64-slc6-gcc8-opt/include/ROOT/RDF/RColumnValue.hxx:140
#4  0x00007fffe6f22d26 in ?? ()
#5  0x00007fffffff40d0 in ?? ()
...

Hi,
we collaborate with CMS users that run RDF on NanoAODs routinely with no issues, so the problem must be specific of your setup.

Given the minimal code reproducer that you posted in your first message, if you could also share (privately) the dataset that causes the issue it would let us debug the crash properly.

Otherwise, we need to find another way to either reproduce your issue locally or get more information about your crash (e.g. you could use lcg/views/LCG_95/x86_64-slc6-gcc8-dbg instead of *-opt – a debug build of ROOT).

Cheers,
Enrico

Hi, thanks for the interest in my issue. I will try to reproduce the error on our CentOS 7 system and generate some ROOT files which can be shared with the same problem.

For now, here is the stacktrace from the debug build, let me know if any other information would be useful

#0  ROOT::Detail::TBranchProxy::ReadNoParentNoBranchCountNoCollection (this=0xadad8f0)
    at /mnt/build/jenkins/workspace/lcg_release_tar/BUILDTYPE/Debug/COMPILER/gcc8binutils/LABEL/slc6/build/projects/ROOT-6.16.00/src/ROOT-6.16.00-build/include/TBranchProxy.h:297                                                                                              
#1  0x00007ffff5128410 in ROOT::Internal::TTreeReaderValueBase::ProxyReadTemplate<&ROOT::Detail::TBranchProxy::ReadNoParentNoBranchCountNoCollection> (this=0x7b9d1f0)                                                                                                          
    at /mnt/build/jenkins/workspace/lcg_release_tar/BUILDTYPE/Debug/COMPILER/gcc8binutils/LABEL/slc6/build/projects/ROOT-6.16.00/src/ROOT/6.16.00/tree/treeplayer/src/TTreeReaderValue.cxx:143                                                                                  
#2  0x00007ffff51245cf in ROOT::Internal::TTreeReaderValueBase::ProxyReadDefaultImpl (this=0x7b9d1f0)
    at /mnt/build/jenkins/workspace/lcg_release_tar/BUILDTYPE/Debug/COMPILER/gcc8binutils/LABEL/slc6/build/projects/ROOT-6.16.00/src/ROOT/6.16.00/tree/treeplayer/src/TTreeReaderValue.cxx:197                                                                                  
#3  0x00007ffff5115014 in ROOT::Internal::TTreeReaderValueBase::ProxyRead (this=0x7b9d1f0)
    at /mnt/build/jenkins/workspace/lcg_release_tar/BUILDTYPE/Debug/COMPILER/gcc8binutils/LABEL/slc6/build/projects/ROOT-6.16.00/src/ROOT-6.16.00-build/include/TTreeReaderValue.h:62                                                                                           
#4  0x00007ffff51247e1 in ROOT::Internal::TTreeReaderValueBase::GetAddress (this=0x7b9d1f0)
    at /mnt/build/jenkins/workspace/lcg_release_tar/BUILDTYPE/Debug/COMPILER/gcc8binutils/LABEL/slc6/build/projects/ROOT-6.16.00/src/ROOT/6.16.00/tree/treeplayer/src/TTreeReaderValue.cxx:252                                                                                  
#5  0x00007fffe5d5629d in ?? ()
#6  0x0000000007b9d1f0 in ?? ()
#7  0x00000000080f0c88 in ?? ()
#8  0x0000000007b9d1f0 in ?? ()
#9  0x00007fffe5fbb56c in ?? ()
#10 0x0000000007b9d1f0 in ?? ()
#11 0x00000000080f0c88 in ?? ()
#12 0x00007fffffff4420 in ?? ()
#13 0x00007fffe5d7d263 in ?? ()
#14 0x00007fffffff4470 in ?? ()
#15 0x00007fffe5fbc93c in ?? ()
#16 0x00007fffffff4420 in ?? ()
#17 0x00007fffffff4470 in ?? ()
#18 0x00000000080f0c80 in ?? ()
#19 0x0000000000188711 in ?? ()
#20 0x00000000080f0c80 in ?? ()
#21 0x00007fffffff4470 in ?? ()
#22 0x00007fffffff4490 in ?? ()
#23 0x00007fffe5f88636 in ?? ()
#24 0x00007fffffff4460 in ?? ()
#25 0x00007ffff508fb1c in TNotifyLink<ROOT::Detail::TBranchProxy>::Notify (this=0x7fffe5fbb56c)
    at /mnt/build/jenkins/workspace/lcg_release_tar/BUILDTYPE/Debug/COMPILER/gcc8binutils/LABEL/slc6/build/projects/ROOT-6.16.00/src/ROOT-6.16.00-build/include/TNotifyLink.h:101                                                                                               
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Hi,
from the stacktrace it would seem that TTreeReader (internally used by RDF to read TTree branches) has trouble with one of your branches. To know which one, one should execute your snippet within gdb and check the appropriate TTreeReaderValueBase datamember at the point of crash.

Are there no warnings or errors printed at screen before the crash?
Is this a proper nanoAOD file that you are running on, or a custom root file you produced?

The quickest would definitely be to have a data sample that we can run on to reproduce the problem.

Cheers,
Enrico

Could you provide more instruction on how to find the branch name, I know how to generate a stacktrace from gdb but not much more than that :pensive:

There ware some warnings generated at startup

TClass::Init:0: RuntimeWarning: no dictionary for class edm::ParameterSetBlob is available
TClass::Init:0: RuntimeWarning: no dictionary for class edm::ProcessHistory is available
TClass::Init:0: RuntimeWarning: no dictionary for class edm::ProcessConfiguration is available
TClass::Init:0: RuntimeWarning: no dictionary for class pair<edm::Hash<1>,edm::ParameterSetBlob> is available

but as far as I am aware no instances of these classes are stored in the Events tree. It is also possible to save the tree (all branches) once a stricter event selection has been performed so it does not seem to be a specific branch.

I am running over unmodified nanoAOD files.

Hi Corin,
my guess seems to be wrong if a stricter selection is enough to make the crash go away.
Those warnings are indeed harmless.

The simplest way to move forward would be for you to share the problematic file (you could just put it on cernbox, for example, and share it with me).

Otherwise, another simple test that you could perform is to run your program with ROOT master instead of 6.16. You can find a nightly build at /cvmfs/sft.cern.ch/lcg/views/ROOT-latest.

Cheers,
Enrico

Hi,
thank you for sharing a simple runnable reproducer.

gdb shows the crash happens when reading the HLT_HcalIsolatedbunch branch. Checking your files, it seems that a couple (9EB58CB8-1C47-E811-9382-FA163E67A014.root and 324D0DE2-BC44-E811-BA69-FA163EFC9F83.root) are missing that branch. That would be enough to explain the crash: Snapshot sees that branch in the first TTree it processes, sets up reading and writing of that branch, but at some point there is no input branch anymore.

Ah, very strange that some files would be without I will have to look into why. Will it be possible for root to handle this more gracefully in future (e.g. throw a runtime error with the relevant branch and file)?

Could you share your gdb workflow so I can identify any other branches with similar issues myself?

Will it be possible for root to handle this more gracefully in future (e.g. throw a runtime error with the relevant branch and file)?

Ideally yes, I opened a jira ticket about this.

Could you share your gdb workflow so I can identify any other branches with similar issues myself?

I used backtrace to check what frame corresponded to a TTreeReaderValue or TTreeReaderValueBase method call, I switched to it with frame 3, then printed the contents of the TTreeReaderValue object with print *this. That output is ugly but squinting you can see a TString data-member fBranchName somewhere in there. I had the program crash 2 or 3 times and the branch name was the same every time, so I checked whether the branch existed in all files.

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.