RDataframe columns from user-function that uses another column

Hi,

I am trying to make a RDataFrame from a user-defined function that computes 3 different values and has an input variable from a dataframe. Then I want to Define the values computed in the user-defined function in 3 different columns of the same dataframe but I don’t want to call this function 3 times for each column since that is very inefficient:

// User-defined function
void MyFunct(Double_t x, Double_t &a, Double_t &b, Double_t &c //, ... has more than the parameters a, b, c to be compute ) {
  // needs x from a dataframe column and computes a, b, c
}
-------------------------------------
void genKM15(){
   
    Double_t a, b,c ; 

    ROOT::RDataFrame df(1000);
    auto df_1 = df.Define("x", []() { return gRandom->Uniform(1., 4); })                    
                    .Define("a", [&](double x) { MyFunct(x, a, b, c); return a; }, {"x"})  
                    .Define("b", [&](double x) { MyFunct(x, a, b, c); return b; }, {"x"})  
                    .Define("c", [&](double x) { MyFunct(x, a, b, c); return c; }, {"x"});         
}

Is there a better way to do this with RDataFrames where I only call MyFunc just once for every row and save the values of a, b, c in the dataframe without having to call MyFunc 3 times for each row?

Thank you!


Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided


Hi @lilina,

IIUC your question, and as far as I know, RDataFrame does not provide any means to define multiple columns at once.

That said, the immediate solution that comes to mind is to separate the computation of each of the values for a , b, and c in different functions that you can reference from each Define() call.
I am not aware of anything more straightforward, but I will leave @vpadulan to reply here as well, as he might have another idea.

Cheers,
J.

Hi,

yes at the moment the way to do this is

df.Define("abc", [] { /* call MyFunct */ return std::tie(a, b, c); })
  .Define("a", [] (const std::tuple<double, double, double>& abc) { return std::get<0>(abc); }, {"abc"})

(repeat last line for “b” and “c”). Or instead of a std::tuple you can create a simple struct with appropriate data members.

There is a long standing feature request for “multi-defines” but we do not have an ETA for it.

As an aside note that using “global” output parameters like that in MyFunct makes it impossible to run a multi-thread event loop successfully: different threads would call MyFunct concurrently and mesh up the writes to the “global” a, b and c variables. If you want parallelism, MyFunct should instead return a tuple or a struct rather than using output parameters.

Cheers,
Enrico

Many thanks for your reply. When I tried the first line:

auto df_1 = df.Define("x", []() { return gRandom->Uniform(1., 4); })                    
              .Define("abc", [&](double x) { MyFunct(x, a, b, c); return std::tie(a, b, c); }, {"x"}) ;

I get the following error:

In module ‘ROOTDataFrame’:
/home/lily/opt/root_master/root_install/include/ROOT/RDF/RInterface.hxx:394:14: error: no matching member function for call to ‘DefineImpl’
return DefineImpl<F, RDFDetail::CustomColExtraArgs::None>(name, std::move(expression), columns, “Define”);

Hi @lilina ,

could you please post the full code and the full error message? It’s hard to tell from just what you posted :grimacing:

Cheers,
Enrico

Sorry about that, here is the full code. I get the error when I use std::tie(ReH, ReE, ReHt)

#include "/media/lily/Data/GPDs/DVCS/GPD_Models/TGPDModels.h"

void genKM15(){

    Double_t M = 0.938272; //Mass of the proton in GeV
    const Int_t nkinPts = 10;   // Number of kinematic points
    Double_t ee;
    Double_t  ReEt, ImH, ImHt; // CFFs that will not be saved on the tree
    Double_t  ReE, ReH, ReHt;

    // Compute tmin 
    auto tmin = [&](double QQ, double xB) { 
        ee = 4. * M * M * xB * xB / QQ;
        double tmin_l = -QQ * ( 2 * ( 1 - xB ) * ( 1. - sqrt( 1. + ee ) ) + ee ) / ( 4 * xB * ( 1. - xB ) + ee ); 
        return tmin_l;        
    };

    ROOT::RDataFrame df(nkinPts);
    auto df_KM15 = df.Define("k", []() { return 5.75; })
                     .Define("QQ", []() { return gRandom->Uniform(1., 4); })
                     .Define("xB", []() { return gRandom->Uniform( 0.1, 0.7 ); })
                     .Define("t", [&](double QQ, double xB) { return gRandom->Uniform( -1, tmin(QQ, xB)); }, {"QQ", "xB"})   
                     .Define("cffs", [&](double xB, double t) { ModKM15_CFFs(xB, t, ReH, ImH, ReE, ReHt, ImHt, ReEt); return std::tie(ReH, ReE, ReHt); }, {"xB", "t"}) ;

}

Here is the full errors output:

lily@calero-PC:/media/lily/Data/GPDs/TMVA_ANN-global/CFFs_Model_FromKM15/KM15_NoErrors$ root -l genKM15.C 
root [0] 
Processing genKM15.C...
In module 'ROOTDataFrame':
/home/lily/opt/root_master/root_install/include/ROOT/RDF/RInterface.hxx:394:14: error: no matching member function for call to 'DefineImpl'
      return DefineImpl<F, RDFDetail::CustomColExtraArgs::None>(name, std::move(expression), columns, "Define");
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/media/lily/Data/GPDs/TMVA_ANN-global/CFFs_Model_FromKM15/KM15_NoErrors/genKM15.C:44:23: note: in instantiation of function template specialization 'ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager, void>::Define<(lambda at /media/lily/Data/GPDs/TMVA_ANN-global/CFFs_Model_FromKM15/KM15_NoErrors/genKM15.C:44:37), 0>' requested here
                     .Define("ReH", [&](double xB, double t) { ModKM15_CFFs(xB, t, ReH, ImH, ReE, ReHt, ImHt, ReEt); return std::tie(ReH, ReE, ReHt); }, {"xB", "t"}) ;
                      ^
/home/lily/opt/root_master/root_install/include/ROOT/RDF/RInterface.hxx:3244:4: note: candidate template ignored: requirement 'std::is_default_constructible<std::tuple<double &, double &, double &> >::value' was not satisfied [with F = (lambda at /media/lily/Data/GPDs/TMVA_ANN-global/CFFs_Model_FromKM15/KM15_NoErrors/genKM15.C:44:37), DefineType = ROOT::Detail::RDF::CustomColExtraArgs::None, RetType = std::tuple<double &, double &, double &>]
   DefineImpl(std::string_view name, F &&expression, const ColumnNames_t &columns, const std::string &where)
   ^
/home/lily/opt/root_master/root_install/include/ROOT/RDF/RInterface.hxx:3295:4: note: candidate function template not viable: requires 3 arguments, but 4 were provided
   DefineImpl(std::string_view, F, const ColumnNames_t &)
   ^
root [1] 

I don’t know why I get those errors when using return std::tie(a, b, c) on the Define. I have also tried changing MyFunct(i.e. ModKM15_CFFs) to return a tuple as you suggested, so then it can be multi-thread safe as well. That way it works and I can call MyFunct just once per row and then access the tuple values. Here is the working simplified example code:

#include "/media/lily/Data/GPDs/DVCS/GPD_Models/TGPDModels_tuple.h"
void genKM15(){

    ROOT::EnableImplicitMT();      
  
    ROOT::RDataFrame df(100);   
    auto df_KM15 = df.Define("x", []() { return gRandom->Uniform(1., 4); })                   
                    .Define("abc", [](double x) { return MyFunct(x); }, {"x"}) 
                    .Define("a", [] (const std::tuple<double, double, double, double, double, double>& abc) { return std::get<0>(abc); }, {"abc"});
   // repeating this for all the other parameters (a, b, ...)
}

Thanks a lot for your help!

Hi @lilina ,

my fault, I should have suggested std::make_tuple rather than std::tie: as per the error message the latter creates a tuple of references (tuple<double &, ...>) which Define does not like as a return type because it’s not default-constructible. std::make_tuple would create a tuple of values (tuple<double, ...>) and would work.

I’m happy you found how to make it work.

Cheers,
Enrico

Thank you! I appreciate all your help :slight_smile: