Using a python dictionary in RDataFrame Define using a column as the key

Hi

I am enjoying using RDataFrames within python but I am trying to get my head around a particular problem. When processing a number of files, I will need to calculate a normalisation factor which will differ between different samples. My naive approach is to calculate this once and store as a python dictionary where I have an dictionary object, totalEventsWeighted = {“dataset identifier” : floating value, … }

Ideally I would then want to create a total weight variable with Define where I combine a number of columns, but also multiply in the value from this map, where I use the dataset identifier column from the event being processed to pull this number from my dictionary.

My issue is that I do not know the correct way to handle this. It seems that I cannot use this python dictionary directly in a Define function, and I think I probably need to write a custom C++ function to use, but I have no idea how the python dictionary should be represented.

Would someone be able to help out?
Thanks, Ian

Hi @iconnell ,
two solutions I can think of:

A. Build a C++ std::unordered_map from a C++ helper function instead of a Python dictionary. You can use C++ objects in RDataFrame code.

import ROOT

ROOT.gInterpreter.Declare("""
    auto SomeFunction() {
        // you'll have to put your logic here
        return std::unordered_map<ULong64_t, int>{{0, 42}, {1, 8}};
    }

    std::unordered_map<ULong64_t, int> myMap = SomeFunction();
""")

ROOT.RDataFrame(2).Define("colId", "myMap.at(rdfentry_)")\
                  .Display(["colId"])\
                  .Print()

# prints
# +-----+-------+
# | Row | colId |
# +-----+-------+
# | 0   | 42    |
# +-----+-------+
# | 1   | 8     |
# +-----+-------+

B. With a recent-enough ROOT version, you can use numba to jit a Python function that retrieves the values from the Python dictionary. The numba-jitted function can be used in RDataFrame:

import ROOT
from typing import Dict


@ROOT.Numba.Declare(["int"], "int")
def getDictionaryValue(col_id: int):
    dictionary : Dict[int, int] = {0: 42, 1: 8}
    return dictionary[col_id]

ROOT.RDataFrame(2).Define("colId", "Numba::getDictionaryValue(rdfentry_)")\
                  .Display(["colId"])\
                  .Print()

# prints:
# +-----+-------+
# | Row | colId |
# +-----+-------+
# | 0   | 42    |
# +-----+-------+
# | 1   | 8     |
# +-----+-------+

I hope this helps!
Enrico

Hi Enrico

Thanks this is very helpful. I have been working through this, and I think the best approach is the first one. I have a C++ function which I can call, for instance with ROOT.gInterpreter.ProcessLine(…) and then call my function in pyROOT, ROOT.CalculateNormalisation(…) and have checked it functions.

The issue I have is that I would like to be able to pass in a list of files to this function first (which may vary between runs). However, from what I can see, when I call the Declare portion, this is treated as static and I cannot pass in a variable? Is there any way I can create an object through the python interface which would be visible on the C++ side, for instance, the way that myMap then becomes usable? I guess maybe like some kind of decoration like the @Numba would work (except this will not work in my case as I need to use std::strings as well in my map).

Thanks, Ian

Something like this should be flexible enough:

import ROOT

ROOT.gInterpreter.Declare("""
   auto &GetTheMap() {
      static std::unordered_map<int, int> theGlobalMap;
      return theGlobalMap;
   }
""")

m = ROOT.GetTheMap()
m[0] = 42
m[1] = 8

ROOT.RDataFrame(2)\
    .Define("colId", "GetTheMap()[rdfentry_]")\
    .Display()\
    .Print()

Up to you whether you fill the map from Python or from another C++ function then. You can define a C++ function that fills the map as e.g.

ROOT.gInterpreter.Declare("""
      void FillTheMap(const std::vector<std::string> &files) {
        auto &m = GetTheMap();
        // fill m
   }
""")

and you can call it from python as ROOT.FillTheMap(["f1", "f2"]).

Cheers,
Enrico

Hi Enrico

Thanks a lot for this follow-up. It is extremely useful, and I now have something workable. It feels like it might be very slow in my implementation, but that will be something to follow up on later.

Thanks, Ian

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.