Using advanced filters with RDataFrame in python

I am working on a cutflow using RDataFrame in python and am unable to access advanced features in a filter such as a lambda expression. Even if represented as a string, anything beyond a simple expression will give an error.

For example, this DOES work,

rdf.Filter( "particle_pt.size() > 1" )

This DOES NOT work,

 rdf.Filter( "bool([](){return true;})" )

For my intent, I need access to some more advanced control flow statements like for-loops in order to have a more comprehensive and adaptive cutflow. Without this, I may not be able to continue with RDataFrame/python.

The error is,

input_line_80:1:39: warning: expression result unused [-Wunused-value]
namespace __tdf_0{ auto tdf_f = []() {bool([](){return true;})
                                      ^    ~~~~~~~~~~~~~~~~~~
input_line_84:3:25: error: expected ';' after expression
bool([](){return true;})

For a comparison/check, I call a similar function in root to see what it returns,

 root [0] bool([](){return true;})
 (bool) true

Perhaps Filter is not made to handle something like this and if not, is there another way to implement more advanced logic here?

Dale Abbott

_ROOT Version: 6.14
_Python: 3.6
Platform: Not Provided
Compiler: Not Provided

that code does not do what you think it does: try replacing true with false in the body of your lambda :smile:

In any case, what you can put in string filters and defines is the body of a C++ function/lambda, possibly with variable names replaced by branch/column names. For example you can write something like this:

df.Filter("std::cout << \"I'm at entry \" << tdfentry_ << std::endl; return sqrt(var1 + var2) < 2;");

(in ROOT master you can also use rdfentry_ instead of tdfentry_ to indicate the current entry number)

Hope this helps!

Treating it as the body of a lambda was helpful. I was able to use variable declarations, control statements and returns, so I believe you have answered my question.

Thank you,

Good, marking as solved then!

Some extra notes: in practice we actually take the C++ code you write in those strings and just-in-time compile a C++ lambda with that body. So that’s that, the only preprocessing we do is substituting column names with unique and valid placeholder variable names.

What bool([](){return true;}) does is to convert a lambda temporary to a function pointer, and then convert that function pointer to the boolean true. The expression always evaluates to true, independently of the body of the lambda.

Nevertheless it is a valid C++ expression so it’s a bit weird that we throw an error in that case. The reason is that you have the keyword return in there, which makes our parser think that you are explicitly returning something from your code so we don’t have to add a return keyword to the lambda that we just-in-time compile. This ansatz is wrong in this case.


I remember a post somewhere here showing a way to load a C++ header (or perhaps source?) when using RDataFrame from PyROOT. I can’t find it anymore, do you know of such a method?

I guess you are looking for ROOT.gInterpreter.Declare or ROOT.gInterpreter.ProcessLine.

(please open a new thread rather than necrobumping :smile: )

It came up in a conversation with @dabbott offline, hence why I posted in this thread.
I think Declare() is what I was thinking of, thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.