Vectorise operation of defining new variables in RDataFrames

Davide_Lancierini · August 14, 2024, 2:33pm

Dear experts. Following this thread: https://root-forum.cern.ch/t/make-rdataframes-interoperable-with-other-python-tools I was wondering if the snippet defined below

import ROOT
import numpy as np

def add_to_df(df):
    prediction = np.arange(10, dtype=np.float)

    @ROOT.Numba.Declare(["int"], "float")
    def get_prediction(index):
        return prediction[index]

    df = df.Define("x", "Numba::get_prediction(rdfentry_)")
    return df

df = ROOT.RDataFrame(10)
df = add_to_df(df)
print(df.AsNumpy())

Could be vectorised in some way. I.e. in case I have to define several variables (”x” in the above script), I can add them all in one step without having to loop over them, namely avoiding something of the kind:

df = ROOT.RDataFrame(10)
for prediction in predictions:
    df = add_to_df(df)

(I’m aware the above script won’t work but is just a draft to explain myself better)

Thanks a lot!
Davide

ROOT Version: 6.30 and above
Platform: Not Provided
Compiler: Not Provided

Danilo · August 15, 2024, 5:52am

Hi Davide,

Thanks for the interesting post.

I am perhaps a bit confused, sorry about that: what would be the advantage of the vectorisation in setting up the computation graph, i.e. using the nice function you wrote to add Defines?
Are you looking perhaps to a syntax more elegant than the for loop in python?

Cheers,
Danilo

Davide_Lancierini · August 20, 2024, 9:09am

Dear Danilo,

Thanks for your swift reply, and sorry for my delayed one. I was out of office the past week. The reason to avoid a for loop of Define is because in case of addition of many new variables (around 20 or more) it is extremely slow. So I was wondering if there could be a way that the arguments of Define could be vectorised and the new variables could be added without looping over them to speed up the computation. Of course if there is some other solution that doesn’t involve vectorisation but that would avoid looping over the variables that would work as well.

I will provide with a reproducible snippet to test speed with the for loop.

Cheers,
Davide

Danilo · August 24, 2024, 6:12am

Hi Davide,

Thanks.

About

I am not sure I get this. Defining a new column should be very fast, it’s just a matter of booking a feature in the computation graph. Can you provide evidence of such a slow down? I don’t exclude it’s there, but it would be really unexpected (and unwanted).

Cheers,
D

Davide_Lancierini · September 5, 2024, 3:43pm

I’m sorry for having raised the problem without timing exactly all the steps, the define was not the slowest operation. I’ve been able to speed up where possible other parts of the workflow that were slowing down the new variable addition.

Danilo · September 10, 2024, 6:51pm

Thanks a lot for sharing the solution with the Community! Hopefully this will be of help for others.

system · September 24, 2024, 6:51pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.