I know that while using python for RDataFrame, I can write block C++ code to define a function and pass it to the gInterpreter for use with Define or Filter along with column names as the arguments.
What I’d like to do is remove the argument from the function (say the column name will always be the same no matter the implementation) but still have the column value accessible in the function. For example, instead of doing
rdf.Define("newvar", myFunc(var))
I’d like to do
rdf.Define("newvar",myFunc())
where var is a column in rdf. What I can’t figure out is how to access var in myFunc() without passing it as an argument. Any ideas?
Basically you want to let the C++ interpreter know the value of the python function with TPython::Eval.
We know this is not optimal. Experimental PyROOT allows passing python functions to RDF, but it’s only worth it performance-wise for simple python lambdas.
Thanks for the reply. I don’t think I communicated my question effectively because of the way I formatted the Define lines in the original post so let me try again with a fuller example and correct formatting (I’m using RDataFrame in the context of NanoAOD for the example just to give some context).
Say I have a multi-line string that is my C++ code like this:
But since I only ever want to return the pt from the FatJet collection, I’d like to simplify things and bake the collection name into myFunc. So instead I want something like,
The question is - how do I tell the myFunc code about the vector stored in the FatJet_pt column of the data frame? Because if I try to call it as I’ve written it, it can’t compile because FatJet_pt is undefined.
Hi,
I don’t think I understand what you want to do, sorry…FatJet_pt is a dataset column, right? Then how can myFunc use it just like that if it doesn’t even exist before the event loop start and it’s a different object for every event? It would seem more natural (but I’m probably misunderstanding) that myFun takes FatJet_pt as an additional argument, which will take different values for different events.
I’d like to be able to do this sort of thing for the case of accessing 10 or more column values in a function at a time. It seems silly to have a function with more than 10 arguments if those arguments are always going to be the same.
So I guess the follow up is - is it possible to create a place holder to get myFunc to compile and then overwrite it with the column value as the rows are looped over? The column value has to be somewhere in memory but is there anyway to access it that doesn’t involve passing it as an argument? Maybe the answer is “no” but I wanted to ask since you’re the expert!
But isn’t the whole point of the thread precisely that they are not going to be the same and you would need to update the values during the event loop?
I don’t know…it’s possible to write something like this (C++, but something similar is available for PyROOT, see the other thread I linked above), maybe you can adapt it to fit your needs:
int some_quantity = 4;
// lambda which doesn't take any argument
auto cpp_lambda_that_depends_on_some_quantity = [some_quantity]() { return some_quantity; };
...
auto df2 = df.Define(cpp_lambda_that_depends_on_some_quantity);
But the issue remains that the only things you can do during the event loop managed by RDataFrame are those you specified in Filter/Define/Foreach.