Passing Templated Functions to RDataFrame

lost_soul_519 · January 20, 2025, 6:38pm

Hello all,

How do you pass a templated/overloaded function to RDataFrame define/filter?
The alternative would be to use strings or lamdas after depending on the column type, neither of which is convenient.

template <typename T>
RVec<T> square(RVec<T> vec)
{
    RVecD squared_vec;
    for (auto v : vec)
    {
        squared_vec.push_back(v * v);
    }
    return squared_vec;
}

int main()
{
    ROOT::RDataFrame df(10);
    auto df_defines = df.Define("x", get_random_vector)
                        .Define("y", get_random_int_vector)
                        .Define("x_squared", square, {"x"})
                        .Define("y_squared", square, {"y"});

    df_defines.Display({"x", "y"})->Print();


    return 0;
}

RVecD get_random_vector()
{
    RVecD vec;
    for (int i = 0; i < 10; i++)
    {
        vec.push_back(gRandom->Gaus());
    }
    return vec;
}

RVecI get_random_int_vector()
{
    RVecI vec;
    for (int i = 0; i < 10; i++)
    {
        vec.push_back(gRandom->Integer(10));
    }
    return vec;
}

Please read tips for efficient and successful posting and posting code

Please fill also the fields below. Note that root -b -q will tell you this info, and starting from 6.28/06 upwards, you can call .forum bug from the ROOT prompt to pre-populate a topic.

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided

Danilo · January 20, 2025, 9:37pm

Hi,

the type of the column should appear explicitly:

    auto df_defines = df.Define("x", get_random_vector)
                        .Define("y", get_random_int_vector)
                        .Define("x_squared", square<double>, {"x"})
                        .Define("y_squared", square<double>, {"y"});

Some additional suggestions:

Pass by const reference collections such as RVec ( e.g. myfunc (const RVec<T>& v))
The VecOps can greately facilitate operations on RVecs, e.g. sqrt(myRvec) returns an RVec of the square roots, or myRvec*myRvec returns the vec of squares

I hope this helps!

Cheers,
D

lost_soul_519 · January 21, 2025, 12:13pm

Hi @Danilo ,
Thanks for the quick response.

Thanks for the type thing <> and the suggestions.

I have a non trivial modification that has to go through each column. The above was just a reproducer.
Passing the type explicity is not ideal for the situation but I suppose it has work.
Just checking if you have a better recommendation.

  for (auto col : column_list)
{
        auto type = rdf.GetColumnType(col);
        if (type.find("double") != std::string::npos)
            rdf = rdf.Define(col + "_mod", square<double>, {col})
                                   
        else if (type.find("int") != std::string::npos)
              rdf = rdf.Define(col + "_mod", square<int>, {col})
        ...
            
}

This has to be done for each unique column type which itself is a long list containing unsigned int and various other type modifiers. Do let me know if there is anything better to do here.

Thanks again.

Danilo · January 21, 2025, 1:19pm

Hi,

The column type is something that can be known only at run time, while the template type is something that has to be known at compile time: reconciling these two is always going to require some work.
One option could be to create all of your functions as overloads, not templates, jit them with cling, and then invoke them in the define. Depending on your functions, the performance penalty may be small to negligible:

myFunctions.h

RVec<int> foo(const RVec<int>& x) {...};
RVec<double> foo(const RVec<double>& x) {...};
...

...
gInterpreter->Declare("#include \"myFunctions.h\"")
for (const auto& col : column_list) {
   rdf = rdf.Define(col + "_mod", "foo("+col+")"); // <-- this invokes the right overload
}
...

I hope this helps.

Cheers,
D

lost_soul_519 · January 21, 2025, 2:03pm

Yes, I suppose jitting is the way to go for this case.
Thanks a lot for your help.