RDataFrame: Is it possible to use TCuts in Filter?

Dear all,

I am looking to optimise some code which uses
CopyTree(TCut::selection.GetTitle())

I’ve heard RDataframe is a much faster alternative to CopyTree, but have been unable to find a way of applying a TCut selection to the datagram. Is there a way to use Filter() with a TCut string containing && and ||?

Thanks,
Park

Hi Park,
unfortunately RDataFrame does not understand TCuts: one of the reasons why it is faster than alternatives is because Filters use actual (possibly just-in-time-compiled) C++ code.

However, often you can get the string that you were passing to your TCut and converting to a C++ expression with little effort: e.g. if you have a branch called “x”, you can just use df.Filter("x > 0") and that will select entries for which the condition is true.

I hope this clarifies/helps a bit.
Cheers,
Enrico

Thanks Enrico, that does help. Not sure that RDataFrame is the best in this case as the cut string is quite complex.

Hi @parkfield i might be wrong, but isnt enough to do

Tcut cut(bloodystring) ;
Df = Rdataframe(ttree)
Df.Filter(cut.GetTitle()).Snapshot() creating a local copy of the ttree on a temporary tfile and just reopen it?
The gain i observed with this is that the data reduction stage is way faster and modular (also just snapshotting columns you need for the actual work you need later, plus the snapshot can run MT. @eguiraud correct me if i am wrong but i heavely rely on filter followed by a snapshot to perform a copytree with cut and it’s sensibly faster than the raw CopyTree function on lxplus loading files from /eos.

That’s fine if cut.GetTitle() returns an expression that is valid C++ and corresponds to the logic you expect. Since GetTitle() is not meant to produce C++ expressions (but logical expressions on branches often look like valid C++ expressions) this should work most of the time, will produce just-in-time compilation errors if the GetTitle() returns invalid C++, and might silently do something unexpected if TCut and a C++ compiler happen to parse a cut expression differently (I can’t produce an example of this case. Might never happen in practice, but it’s important to mention it)

For example, a cut.GetTitle() might return branch.subbranch[0] > 3 && otherbranch / 2 > 0, which is a perfectly fine RDataFrame filter string (branch names as variable names are fine).

@eguiraud what I also do sometimes to make multiple filters and snapshots in parallel is to directly define the cut expression as column and filter on it afterwards.
Ij practice i define it as

df.Define(“cutslice1”, "CutBloodyExpression==0? - 1:1)
And other defines of cuts
At the end i do a lazy set of snapshots with
df.Filter(“cutslice>0”).Snapshot
Etc,etc.
I don’t get how this can silently fail. I mean either the Tcut expression is wrong or what else? If it’s wrong it will anyway fail for CopyTree as well i guess. Or are you saying that the in time c++ compilation might misunderstand the expression itself?

I don’t get how this can silently fail.

  • Filter and Define expect a C++ expression as a string, with branch names as variable names.
  • TCut::GetTitle returns the cut string, which is written in the same syntax as TTree::Draw selections if I’m not mistaken.

Not all TTree::Draw selections are valid C++, it’s a domain-specific language, but most simple selections will have the exact same meaning in C++. I don’t know whether it’s possible to construct a TCut that is valid C++ but has a different behavior when parsed as C++ rather than as TCut.
@pcanal might be able to comment with more authority.

Cheers,
Enrico

It should not be possible to do so.

To be more specific … “almost not possible” …

Beside that it sometimes it wont compile ( Entry$, Sum$, or @size or most use use of array notation),
if an array is passed to a function and that this function has both a numerical argument (it will be use by TTree::Draw) and a collection argument (it will be used by compiled code … maybe). i.e. for MyFunction(array) TTree::Draw will call the function for each element.

Hi @pcanal, so for non-array branches it will work properly always If i understand it correctly.It’s more my curiosity, I wonder how it works the in-time compilation of TMath::Sqrt, TMath::Sq, TMath::Min, TMath::Max in a TCut expression.
It’s maybe just my curiosity but i wonder how RDataFrame is able to load in those functions in the in-time compiled C++ code. Is the RDataFrame in time compiled expression , compiled against all the ROOT libraries , therefore you are allowed to use any C++ expression defined in any ROOT namespace ?

Yes, RDataFrame uses cling, ROOT’s interpreter, so you can expect all ROOT libraries to be auto-loaded as needed, just like when using the ROOT prompt.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.