Reviving: Access RDataFrame column in function without passing argument

lcorcodilos · October 2, 2020, 10:31pm

Dear experts,

I’m reviving a thread [1] from about a year ago about accessing RDataFrame column values inside a function without passing the branch as an argument. My goals are roughly the same but I think I’ve learned a bit more to convey my question clearer.

To summarize the objective of the original post, I work with the NanoAOD format on CMS and in particular, I’m working on a more generic tool to handle NanoAOD with RDataFrame that tracks the actions on the RDF without being a full proxy/wrapper to RDataFrame.

I also plan on providing common C++ functions that can handle standard algorithms for analyzers (scale factor look-ups, generator particle matching, etc). The idea is to have a standard library of scripts so analyzers aren’t all reproducing the same (coding) work. This means that there are a lot of branch names that are always predictable as well as functions that will (almost) always take the same input NanoAOD branches.

Taking the example of matching generator particles to a jet, this requires a function of > 10 arguments if I have to input all of the different needed branch names. This puts the onus on the end user to write out the full C++ function with all 10+ arguments (in order) for their Define or Filter call in python. Since the point of python (and the tool I’m building) is to make life simpler, this is undesirable

What I would like to do is something like the following (simple example):

rdf = RDataFrame(...)
ROOT.gInterpreter('custom.cc')
rdf.Define('myVar','myFunc()')

where myFunc() is defined in a custom.cc. The custom.cc file is:

float myFunc() {
    return FatJet_pt[0];
}

where FatJet_pt is an RVec<float> in the RDataFrame. The question is, how can I define FatJet_pt before myFunc() so that it compiles but so that it can also access the value in the RDataFrame. It’s fine to use FatJet_pt in the argument to Define so it must be booked somewhere in memory. How can I point to it inside of myFunc() so that the value updates once RDataFrame moves onto the next row/event?

I’ve tried doing the following before passing custom.cc to gInterpreter:

for cname in BaseDataFrame.GetColumnNames():
    ROOT.gInterpreter('%s %s;'%(BaseDataFrame.GetColumnType(cname), cname))

This will compile custom.cc but eventually seg fault. I also tried a variation where I prepend extern to each declaration but I get linking errors (this was a shot in the dark based on some skimming of StackOverflow). I’ve also added these declarations (with and without extern) to a columns.h and included this in custom.cc with similar results.

Any input would be greatly appreciated.

Thanks!
Lucas

[1] - Access RDataFrame column in function without passing argument

Please read tips for efficient and successful posting and posting code

ROOT Version: v6.20.04
Platform: Linux(Ubuntu)
Compiler: cling

eguiraud · October 5, 2020, 8:51am

Hi Lucas,

how can I define FatJet_pt before myFunc() so that it compiles but so that it can also access the value in the RDataFrame?

How can I point to it inside of myFunc() so that the value updates once RDataFrame moves onto the next row/event?

That’s just not possible, I’m afraid. You would need RDF to expose pointers to the column values that are automatically updated during the event loop – those are internals that RDF does not expose by design.

I can think of two alternative approaches.

you could generate the correct invocations for users on the fly:

rdf.Define('myVar', Lcorcoframework.MakeInvMass("muon"))

where MakeInMass("muon") returns a string like InvMass(muon_pt, muon_eta, muon_phi).

you could provide helpers that take and return dataframes:

rdf = Lcorcoframework.AddInvMass("muon", rdf)

where AddInvMass does something like return rdf.Define("muon_invmass", "InvMass(muon_pt, muon_eta, muon_phi)").

Examples are imprecise but I hope you get what I mean with them.

By the way you might also be interested in bamboo, a pythonic framewok for analyzing NanoAODs based on RDataFrame.

Cheers,
Enrico

lcorcodilos · October 6, 2020, 3:15am

Hi Enrico,

Thanks for the reply.

That’s just not possible, I’m afraid. You would need RDF to expose pointers to the column values that are automatically updated during the event loop – those are internals that RDF does not expose by design.

Okay so that answers the question for me permanently (I had been wondering for the past year while making work arounds but never getting around to asking again in clearer wording).

I’m partial to the first suggestion over the second since I prefer to have the end-user interface with the RDF in as direct a way as possible (fewer black boxes and more responsibility on the user). The first suggestion is already something that I’ve setup except with a bit of a different method. I have the C++ function arguments named in the definition with the branch names that it needs. Carrying through my example, it would be

float myFunc(RVec<float> FatJet_pt) {
    return FatJet_pt[0];
}

Then I use clang in python to parse the C++, grab the function definition, and use the argument names in the definition as the arguments in the call to the function (provided those argument names are actually branch names of course ).

Thanks for sending bamboo. On first glance, there are some things I like and some that I don’t but I’ll definitely be taking a better look at it to compare against my own (TIMBER for the record - which I find sort of funny relative to “bamboo”). It certainly has more mature documentation!

Thanks,
Lucas

system · October 20, 2020, 3:16am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.