C++ Functor in RDataFrame::Define(...), with default arguments

I bumped into an I think interesting question. I was trying to write a slightly more complicated functor, which I would then use in an RDataFrame::Define(...) expression for adding a new column.

At first, I defined a number of operator()(...) execution operators on my functor, all with slightly different arguments and return values. This just freaked PyROOT completely out. :slightly_frowning_face: I was getting some very hard-to-understand crashes along the lines of:

xAOD::Init                INFO    Environment initialised for data access
ATE::MuonCalibrator      INFO    Initializing the muon calibrator object
In module 'ROOTDataFrame':
/cvmfs/atlas.cern.ch/repo/sw/software/24.2/AnalysisBaseExternals/24.2.36/InstallArea/x86_64-el9-gcc13-opt/include/ROOT/RDF/RInterface.hxx:331:14: error: cannot compile this scalar expression yet
      return DefineImpl<F, RDFDetail::ExtraArgsForDefine::None>(name, std::move(expression), columns, "Define");
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 *** Break *** segmentation violation
...
Traceback (most recent call last):
  File "/home/krasznaa/ATLAS/tools/atlas-tool-example/build/x86_64-el9-gcc13-dbg/bin/AnalysisDemo_rdf.py", line 22, in <module>
    muon_pt_xaod = df.Define('muon_pt_calib', muCalib, ['Muons'])
  File "/cvmfs/atlas.cern.ch/repo/sw/software/24.2/AnalysisBaseExternals/24.2.36/InstallArea/x86_64-el9-gcc13-opt/lib/ROOT/_pythonization/_rdf_pyz.py", line 381, in _PyDefine
    rdf_node = _handle_cpp_callables(func, rdf._OriginalDefine, col_name, func, cols)
  File "/cvmfs/atlas.cern.ch/repo/sw/software/24.2/AnalysisBaseExternals/24.2.36/InstallArea/x86_64-el9-gcc13-opt/lib/ROOT/_pythonization/_rdf_pyz.py", line 282, in _handle_cpp_callables
    return original_template[type(func)](*args)
cppyy.ll.SegmentationViolation: Could not instantiate Define<ATE::MuonCalibrator>:
  ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Define(basic_string_view<char,char_traits<char> > name, ATE::MuonCalibrator expression, const vector<string>& columns = {}) =>
    SegmentationViolation: segfault in C++; program state was reset

While trying to simplify the issue, I went to only defining a single execution operator on the type in question, which would look like:

    std::vector<float> operator()(const xAOD::MuonContainer& muons,
                                  const CP::SystematicSet& syst = {}) const;

The second, optional argument is not primarily there for RDF. I just want to see if a single type could be made such that it would be usable directly both by RDF and by some hand-written code as well. :thinking:

But when I try to use this version of the code, I get:

xAOD::Init                INFO    Environment initialised for data access
ATE::MuonCalibrator      INFO    Initializing the muon calibrator object
Traceback (most recent call last):
  File "/home/krasznaa/ATLAS/tools/atlas-tool-example/build/x86_64-el9-gcc13-dbg/bin/AnalysisDemo_rdf.py", line 22, in <module>
    muon_pt_xaod = df.Define('muon_pt_calib', muCalib, ['Muons'])
  File "/cvmfs/atlas.cern.ch/repo/sw/software/24.2/AnalysisBaseExternals/24.2.36/InstallArea/x86_64-el9-gcc13-opt/lib/ROOT/_pythonization/_rdf_pyz.py", line 381, in _PyDefine
    rdf_node = _handle_cpp_callables(func, rdf._OriginalDefine, col_name, func, cols)
  File "/cvmfs/atlas.cern.ch/repo/sw/software/24.2/AnalysisBaseExternals/24.2.36/InstallArea/x86_64-el9-gcc13-opt/lib/ROOT/_pythonization/_rdf_pyz.py", line 282, in _handle_cpp_callables
    return original_template[type(func)](*args)
cppyy.gbl.std.runtime_error: Could not instantiate Define<ATE::MuonCalibrator>:
  ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Define(basic_string_view<char,char_traits<char> > name, ATE::MuonCalibrator expression, const vector<string>& columns = {}) =>
    runtime_error: 2 column names are required but 1 was provided: "Muons".
xAOD::TFileAccessTracer   INFO    Sending file access statistics to http://rucio-lb-prod.cern.ch:18762/traces/

Here the error is at least clear. That the code does not want to forego specifying the optional second argument for my functor. Which I now believe may have had something to do with the previous crashes that I observed. Since every operator in that version of the code also had an optional last argument. But with multiple operators to choose from, I guess the JIST code couldn’t figure out what to do. :thinking:

So, at the end of this very long story: Should it not be possible to use such a setup? With an operator that has one or more default arguments?

Cheers,
Attila

ROOT Version: 6.28/10
Platform: RHEL 9
Compiler: GCC 13


Experimenting a bit more, the crash is not because of the default argument(s). That comes as soon as I define multiple “execution operators” on a single functor class. Along the lines of:

struct MyFunctor {
   float              operator()(int foo) const;
   std::vector<float> operator()(int foo, float bar) const;
};

This is certainly a pity. :frowning_face: I thought I had a good code design going here, but apparently not…

Dear @krasznaa ,

The Define operation indeed accepts a function, lambda or callable of any sort, including a functor class with operator(). The singular there is quite important, as the instantiation of the Define call needs to infer the signature of the callable that will be used, at compile time. So far practically that has meant that there can only be one signature of the callable specified to Define. We could think about extending this, but the first immediate obstacle in our way would be the fact that the information about the columns held in the dataset (and thus the column types) is only available at runtime. Practically, I don’t see an immediate way to perform the overload resolution while preserving the rest of the machinery. I hope this clarifies the situation a bit more.

Cheers,
Vincenzo

Just as a clarification, nothing prevents the user from helping a bit the compiler, if the information about the column types is known a-priori (as it probably is in general). One could use a simple lambda wrapper (if the functor contains some data useful in the function) or just use free function overloads and specify which overload should be used by the specific Define call. Here is an example.

#include <ROOT/RDataFrame.hxx>
#include <RtypesCore.h>
#include <vector>

struct MyFunctor
{
    float operator()(int foo) const
    {
        return 33.3f;
    }
    std::vector<float> operator()(int foo, float bar) const
    {
        return std::vector<float>{1.1f, 2.2f, 3.3f};
    }
};

float fun(int foo) { return 33.3f; }

std::vector<float> fun(int foo, float bar)
{
    return std::vector<float>{1.1f, 2.2f, 3.3f};
}

int main()
{
    ROOT::RDataFrame df{5};

    ROOT::RDF::RNode df_withcols = df.Define("a", [](ULong64_t entry)
                                             { return static_cast<int>(entry); },
                                             {"rdfentry_"})
                                       .Define("b", [](ULong64_t entry)
                                               { return static_cast<float>(entry); },
                                               {"rdfentry_"});

    MyFunctor f{};
    df_withcols = df_withcols.Define("c", [&f](int a)
                                     { return f(a); },
                                     {"a"});
    df_withcols = df_withcols.Define("d", [&f](int a, float b)
                                     { return f(a, b); },
                                     {"a", "b"});

    df_withcols = df_withcols.Define<float (*)(int)>("e", fun, {"a"});
    df_withcols = df_withcols.Define<std::vector<float> (*)(int, float)>("f", fun, {"a", "b"});

    df_withcols.Display<int, float, float, std::vector<float>, float, std::vector<float>>({"a", "b", "c", "d", "e", "f"})->Print();
}

Cheers,
Vincenzo

1 Like

Thanks for the explanation Vincenzo!

I can accept this as a boundary condition for user code. It’s a bit of a pity that RDF is not more magic than this, but what can one do. :stuck_out_tongue:

Cheers,
Attila

P.S. I’ll have another question, for another thread right away as well… :smile:

1 Like

The only way that I see in which we can make that magic happen is by JIT-ting the call to Define with the column types found at runtime. This means that we would introduce feature-disparity specifically for the Define operation (probably even Filter at that point), so on the surface it doesn’t seem worth the effort. Also, the first option I showed above with the lambda wrappers taking the MyFunctor object by reference don’t even need the user to specify the template arguments, so it looks like a decent compromise (a few more parenthesis to be written in exchange for compile-time magic happening).

Cheers,
Vincenzo

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.