SegBreak when Filter() on certain element for variable-size vector branch

Hello, ROOT experts,

I have a tree that has a branch lepton_pt with type std::vector has 4 elements for some events, but 0 for some other events. The cut I want to apply is >10 for the first element.

With TTree::Draw(), condition is “lepton_pt[0] > 10”, the code operates fine.
But if I use RDataFrame::Define(“ptlep1”, “lepton_pt[0]”) and then RDataFrame::Filter(“ptlep1 > 10”), it will seg break.

I think I can use a lambda function to check the size of the vector first, like nlep, and then have Filter(“nlep>0”).Filter(“ptlep1>10”) to avoid seg break. But first I’m not quite sure how’s the performance of this iterative filter? Second, is it desired that TTree and RDataFrame behaves differently and steepen the learning curve?

Furthermore, the Filter("") condition string doesn’t seem to be operated sequentially. I think usually if the first condition fails, it will not check the second. But if I use Filter(“nlep>0 && ptlep1 > 10”) it will still seg break. Maybe that’s just a feature of some parallel processing of cuts?

-RK

environment:
source /cvmfs/sft.cern.ch/lcg/views/LCG_latest/x86_64-slc6-gcc8-opt/setup.sh
source /cvmfs/sft.cern.ch/lcg/releases/ROOT/6.18.00-2fcb1/x86_64-slc6-gcc8-opt/bin/thisroot.sh


ROOT Version: 6.18.00
Platform: slc6
Compiler: gcc8


hi,

a very concrete suggestion: why not defining a ptlep0 which is 0 it the lenght of the collection is zero? alternatively, you could use this string “lepton_pt.size() > 0 && lepton_pt[0] >10” if lepton_pt is a C array or a vector (if not you can invoke the appropriate method corresponding to size). It would be nice and constructive to report to the root team any perf penalty associated to either approach.

hope this helps,

P

Hello,

Thanks for the quick response! They’re quite helpful! I did another check so it’s actually the Define(“ptlep0”, “lepton_pt[0]”) that caused the seg break, even if “lepton_pt.size()>0” is put in the same filtering condition. If directly calling Filter(“lepton_pt[0] > 10”), there’s actually no issue.

On a first look Filter().Filter() doesn’t cost significantly more time, too.

-RK

Hi,
I just want to add some background about what’s going on: RDataFrame is fundamentally different from TTree::Draw, by design: the former only uses pure C++ as Filter conditions and Define expressions – the latter employs a domain-specific-language that performs several under-the-hood transformations on behalf of the user.

When you write "lepton_pt[0] > 10” in TTree::Draw, you’re not writing C++: that’s a TTree::Draw condition that is parsed and is translated into code that also adds a check for lepton_pt.size() > 0 for you.

When you write the same string in a RDF Filter, that’s exactly the C++ that is executed. In fact, as per the users guide, df.Filter("lepton_pt[0] > 0") is functionally equivalent to df.Filter([](const RVec<float> &lepton_pt) { return lepton_pt[0] > 0; }, {"x"}).

and if lepton_pt does not have a 0-th entry, that will typically result in a segfault.

@Pnine suggested the two ways to make the code safe: adding a Filter to only proceed with computation if lepton_pt.size() > 0, or take the full array but take advantage of short-circuiting to avoid accessing non-existing elements: lepton_pt.size() > 0 ? lepton_pt[0] : 0 (if 0 makes sense as a fallback value). The latter should be slightly faster than the former, as there is one Filter less to invoke, but the difference should be minimal.

Cheers,
Enrico

1 Like

thanks for the great answer!!

Thanks a lot, Enrico!

This is indeed very complete.
Although I do found something different from your description:

df.Filter("lepton_pt[0] > 0") doesn’t actually create Seg Break. So there might be some size check already for that? Although this kind of check is not done for df.Define(“ptlep0”, “lepton_pt[0]”)

Checking the scope when defining a new variable is a better approach as I feel the definition is more transferable to other script than addition filter string. Thanks again for @Pnine, too!

-RK

There is no check at all (I wrote the code! :smiley:) , I think you are just being unlucky: out-of-bound access in C++ is not required to error out or segfault, it’s undefined behavior.

Cheers,
Enrico

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.