I am enjoying using RDataFrames within python but I am trying to get my head around a particular problem. When processing a number of files, I will need to calculate a normalisation factor which will differ between different samples. My naive approach is to calculate this once and store as a python dictionary where I have an dictionary object, totalEventsWeighted = {“dataset identifier” : floating value, … }
Ideally I would then want to create a total weight variable with Define where I combine a number of columns, but also multiply in the value from this map, where I use the dataset identifier column from the event being processed to pull this number from my dictionary.
My issue is that I do not know the correct way to handle this. It seems that I cannot use this python dictionary directly in a Define function, and I think I probably need to write a custom C++ function to use, but I have no idea how the python dictionary should be represented.
A. Build a C++ std::unordered_map from a C++ helper function instead of a Python dictionary. You can use C++ objects in RDataFrame code.
import ROOT
ROOT.gInterpreter.Declare("""
auto SomeFunction() {
// you'll have to put your logic here
return std::unordered_map<ULong64_t, int>{{0, 42}, {1, 8}};
}
std::unordered_map<ULong64_t, int> myMap = SomeFunction();
""")
ROOT.RDataFrame(2).Define("colId", "myMap.at(rdfentry_)")\
.Display(["colId"])\
.Print()
# prints
# +-----+-------+
# | Row | colId |
# +-----+-------+
# | 0 | 42 |
# +-----+-------+
# | 1 | 8 |
# +-----+-------+
B. With a recent-enough ROOT version, you can use numba to jit a Python function that retrieves the values from the Python dictionary. The numba-jitted function can be used in RDataFrame:
Thanks this is very helpful. I have been working through this, and I think the best approach is the first one. I have a C++ function which I can call, for instance with ROOT.gInterpreter.ProcessLine(…) and then call my function in pyROOT, ROOT.CalculateNormalisation(…) and have checked it functions.
The issue I have is that I would like to be able to pass in a list of files to this function first (which may vary between runs). However, from what I can see, when I call the Declare portion, this is treated as static and I cannot pass in a variable? Is there any way I can create an object through the python interface which would be visible on the C++ side, for instance, the way that myMap then becomes usable? I guess maybe like some kind of decoration like the @Numba would work (except this will not work in my case as I need to use std::strings as well in my map).
Thanks a lot for this follow-up. It is extremely useful, and I now have something workable. It feels like it might be very slow in my implementation, but that will be something to follow up on later.