Using a python dictionary in RDataFrame Define using a column as the key

iconnell · November 23, 2021, 11:34am

Hi

I am enjoying using RDataFrames within python but I am trying to get my head around a particular problem. When processing a number of files, I will need to calculate a normalisation factor which will differ between different samples. My naive approach is to calculate this once and store as a python dictionary where I have an dictionary object, totalEventsWeighted = {“dataset identifier” : floating value, … }

Ideally I would then want to create a total weight variable with Define where I combine a number of columns, but also multiply in the value from this map, where I use the dataset identifier column from the event being processed to pull this number from my dictionary.

My issue is that I do not know the correct way to handle this. It seems that I cannot use this python dictionary directly in a Define function, and I think I probably need to write a custom C++ function to use, but I have no idea how the python dictionary should be represented.

Would someone be able to help out?
Thanks, Ian

eguiraud · November 23, 2021, 12:51pm

Hi @iconnell ,
two solutions I can think of:

A. Build a C++ std::unordered_map from a C++ helper function instead of a Python dictionary. You can use C++ objects in RDataFrame code.

import ROOT

ROOT.gInterpreter.Declare("""
    auto SomeFunction() {
        // you'll have to put your logic here
        return std::unordered_map<ULong64_t, int>{{0, 42}, {1, 8}};
    }

    std::unordered_map<ULong64_t, int> myMap = SomeFunction();
""")

ROOT.RDataFrame(2).Define("colId", "myMap.at(rdfentry_)")\
                  .Display(["colId"])\
                  .Print()

# prints
# +-----+-------+
# | Row | colId |
# +-----+-------+
# | 0   | 42    |
# +-----+-------+
# | 1   | 8     |
# +-----+-------+

B. With a recent-enough ROOT version, you can use numba to jit a Python function that retrieves the values from the Python dictionary. The numba-jitted function can be used in RDataFrame:

import ROOT
from typing import Dict


@ROOT.Numba.Declare(["int"], "int")
def getDictionaryValue(col_id: int):
    dictionary : Dict[int, int] = {0: 42, 1: 8}
    return dictionary[col_id]

ROOT.RDataFrame(2).Define("colId", "Numba::getDictionaryValue(rdfentry_)")\
                  .Display(["colId"])\
                  .Print()

# prints:
# +-----+-------+
# | Row | colId |
# +-----+-------+
# | 0   | 42    |
# +-----+-------+
# | 1   | 8     |
# +-----+-------+

I hope this helps!
Enrico

iconnell · November 23, 2021, 2:30pm

Hi Enrico

Thanks this is very helpful. I have been working through this, and I think the best approach is the first one. I have a C++ function which I can call, for instance with ROOT.gInterpreter.ProcessLine(…) and then call my function in pyROOT, ROOT.CalculateNormalisation(…) and have checked it functions.

The issue I have is that I would like to be able to pass in a list of files to this function first (which may vary between runs). However, from what I can see, when I call the Declare portion, this is treated as static and I cannot pass in a variable? Is there any way I can create an object through the python interface which would be visible on the C++ side, for instance, the way that myMap then becomes usable? I guess maybe like some kind of decoration like the @Numba would work (except this will not work in my case as I need to use std::strings as well in my map).

Thanks, Ian

eguiraud · November 23, 2021, 2:53pm

Something like this should be flexible enough:

import ROOT

ROOT.gInterpreter.Declare("""
   auto &GetTheMap() {
      static std::unordered_map<int, int> theGlobalMap;
      return theGlobalMap;
   }
""")

m = ROOT.GetTheMap()
m[0] = 42
m[1] = 8

ROOT.RDataFrame(2)\
    .Define("colId", "GetTheMap()[rdfentry_]")\
    .Display()\
    .Print()

Up to you whether you fill the map from Python or from another C++ function then. You can define a C++ function that fills the map as e.g.

ROOT.gInterpreter.Declare("""
      void FillTheMap(const std::vector<std::string> &files) {
        auto &m = GetTheMap();
        // fill m
   }
""")

and you can call it from python as ROOT.FillTheMap(["f1", "f2"]).

Cheers,
Enrico

iconnell · November 23, 2021, 3:59pm

Hi Enrico

Thanks a lot for this follow-up. It is extremely useful, and I now have something workable. It feels like it might be very slow in my implementation, but that will be something to follow up on later.

Thanks, Ian

system · December 7, 2021, 3:59pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.