Vectorise operation of defining new variables in RDataFrames

Dear experts. Following this thread: https://root-forum.cern.ch/t/make-rdataframes-interoperable-with-other-python-tools I was wondering if the snippet defined below

import ROOT
import numpy as np

def add_to_df(df):
    prediction = np.arange(10, dtype=np.float)

    @ROOT.Numba.Declare(["int"], "float")
    def get_prediction(index):
        return prediction[index]

    df = df.Define("x", "Numba::get_prediction(rdfentry_)")
    return df

df = ROOT.RDataFrame(10)
df = add_to_df(df)
print(df.AsNumpy())

Could be vectorised in some way. I.e. in case I have to define several variables (”x” in the above script), I can add them all in one step without having to loop over them, namely avoiding something of the kind:

df = ROOT.RDataFrame(10)
for prediction in predictions:
    df = add_to_df(df)

(I’m aware the above script won’t work but is just a draft to explain myself better)

Thanks a lot!
Davide


ROOT Version: 6.30 and above
Platform: Not Provided
Compiler: Not Provided


Hi Davide,

Thanks for the interesting post.

I am perhaps a bit confused, sorry about that: what would be the advantage of the vectorisation in setting up the computation graph, i.e. using the nice function you wrote to add Defines?
Are you looking perhaps to a syntax more elegant than the for loop in python?

Cheers,
Danilo

Dear Danilo,

Thanks for your swift reply, and sorry for my delayed one. I was out of office the past week. The reason to avoid a for loop of Define is because in case of addition of many new variables (around 20 or more) it is extremely slow. So I was wondering if there could be a way that the arguments of Define could be vectorised and the new variables could be added without looping over them to speed up the computation. Of course if there is some other solution that doesn’t involve vectorisation but that would avoid looping over the variables that would work as well.

I will provide with a reproducible snippet to test speed with the for loop.

Cheers,
Davide

Hi Davide,

Thanks.

About

I am not sure I get this. Defining a new column should be very fast, it’s just a matter of booking a feature in the computation graph. Can you provide evidence of such a slow down? I don’t exclude it’s there, but it would be really unexpected (and unwanted).

Cheers,
D

I’m sorry for having raised the problem without timing exactly all the steps, the define was not the slowest operation. I’ve been able to speed up where possible other parts of the workflow that were slowing down the new variable addition.

1 Like

Thanks a lot for sharing the solution with the Community! Hopefully this will be of help for others.