Access RDataFrame column in function without passing argument

lcorcodilos · September 1, 2019, 8:33pm

Hi all,

I know that while using python for RDataFrame, I can write block C++ code to define a function and pass it to the gInterpreter for use with Define or Filter along with column names as the arguments.

What I’d like to do is remove the argument from the function (say the column name will always be the same no matter the implementation) but still have the column value accessible in the function. For example, instead of doing

rdf.Define("newvar", myFunc(var))

I’d like to do

rdf.Define("newvar",myFunc())

where var is a column in rdf. What I can’t figure out is how to access var in myFunc() without passing it as an argument. Any ideas?

Thanks!

couet · September 2, 2019, 6:18am

I think @eguiraud can help you.

eguiraud · September 2, 2019, 10:15am

Hi,
If the variable you want to use in the c++ function comes from python, some trickery is required, see the example code here: RDataFrame: Defining new column evaluated as a function of external values in Python

Basically you want to let the C++ interpreter know the value of the python function with TPython::Eval.

We know this is not optimal. Experimental PyROOT allows passing python functions to RDF, but it’s only worth it performance-wise for simple python lambdas.

Cheers,
Enrico

lcorcodilos · September 5, 2019, 7:15pm

Hi Enrico,

Thanks for the reply. I don’t think I communicated my question effectively because of the way I formatted the Define lines in the original post so let me try again with a fuller example and correct formatting (I’m using RDataFrame in the context of NanoAOD for the example just to give some context).

Say I have a multi-line string that is my C++ code like this:

namespace analyzer {
    bool myFunc(float pt) {
        return pt
    }
}

and I pass it to the gInterpreter and call it in a Define command like:

rdf = RDataFrame(...)
new_rdf = rdf.Define("analyzer::myFunc(FatJet_pt[0])")

But since I only ever want to return the pt from the FatJet collection, I’d like to simplify things and bake the collection name into myFunc. So instead I want something like,

namespace analyzer {
    bool myFunc(int index) {
        return FatJet_pt[index]
    }
}

and

rdf = RDataFrame(...)
new_rdf = rdf.Define("analyzer::myFunc(0)")

The question is - how do I tell the myFunc code about the vector stored in the FatJet_pt column of the data frame? Because if I try to call it as I’ve written it, it can’t compile because FatJet_pt is undefined.

Thanks!

eguiraud · September 6, 2019, 8:47am

Hi,
I don’t think I understand what you want to do, sorry…FatJet_pt is a dataset column, right? Then how can myFunc use it just like that if it doesn’t even exist before the event loop start and it’s a different object for every event? It would seem more natural (but I’m probably misunderstanding) that myFun takes FatJet_pt as an additional argument, which will take different values for different events.

lcorcodilos · September 6, 2019, 11:59am

Hi Enrico,

You actually understand the issue exactly

I’d like to be able to do this sort of thing for the case of accessing 10 or more column values in a function at a time. It seems silly to have a function with more than 10 arguments if those arguments are always going to be the same.

So I guess the follow up is - is it possible to create a place holder to get myFunc to compile and then overwrite it with the column value as the rows are looped over? The column value has to be somewhere in memory but is there anyway to access it that doesn’t involve passing it as an argument? Maybe the answer is “no” but I wanted to ask since you’re the expert!

eguiraud · September 6, 2019, 12:31pm

But isn’t the whole point of the thread precisely that they are not going to be the same and you would need to update the values during the event loop?

I don’t know…it’s possible to write something like this (C++, but something similar is available for PyROOT, see the other thread I linked above), maybe you can adapt it to fit your needs:

int some_quantity = 4;
// lambda which doesn't take any argument
auto cpp_lambda_that_depends_on_some_quantity = [some_quantity]() { return some_quantity; };
...
auto df2 = df.Define(cpp_lambda_that_depends_on_some_quantity);

But the issue remains that the only things you can do during the event loop managed by RDataFrame are those you specified in Filter/Define/Foreach.

Hope this helps!
Enrico

system · September 20, 2019, 12:31pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.