Problem calling RDataFrame::Define from PyROOT with plain function

I currently have some troubles when trying to define a new column in an RDataFrame:

I have created a simple test file that contains a single branch as a std::vector<int>

TFile* f = new TFile("test_file.root", "recreate");
TTree* t = new TTree("tree", "tree with some values");

auto* vec = new std::vector<int>();
vec->reserve(5);
t->Branch("vec", &vec);

for (int i = 0; i < 10; ++i) {
  vec->clear();
    for (int j = 0; j < 5; ++j) {
      vec->push_back((int) gRandom->Uniform(0, 10));
    }
  t->Fill();
}
t->Write();
f->Write();
f->Close();

When I then try to Define a new column in an RDataFrame using

import ROOT
ROOT.gInterpreter.ProcessLine('''
std::vector<int> double_index(ROOT::VecOps::RVec<int> ind) {
  std::vector<int> indices;
  indices.reserve(ind.size());
  for (const auto i : ind) {
    indices.push_back(i * 2);
  }
  return indices;
}''')

rdf = ROOT.RDataFrame('tree', 'test_file.root')
rdf.Define('double_idx', ROOT.double_index, ['vec'])

I get the following error

Traceback (most recent call last):
  File "rdf_new.py", line 14, in <module>
    rdf.Define('double_idx', ROOT.double_index, ['vec'])
TypeError: can not resolve method template call for 'Define'

The problem in this case does not seem to be the ['vec'] that is not correctly identified as a std::vector<std::string> but rather the double_index function. I have tried the workaround for specifying the last argument as an actual vector of strings, outlined in Problem callding RDataFrame::Define with ColumnNames_t with python and that leads to the same error.

I am not sure whether I am running into a bug here, or whether I have to call Define in a slightly different way. Any help is very much appreciated.


Please read tips for efficient and successful posting and posting code

_ROOT Version: 6.20.04
_Platform: CentOS7
_Compiler: gcc8.3.0


Hi,
the Python spelling is

rdf.Define('double_idx', 'double_index(vec)')

the reason being that ROOT.double_index is not implicitly convertible to a C++ callable type by PyROOT (not 100% sure why).

Alternatively you can do like in the thread you linked to and define a functor type (a class with a call operator) and pass an instance of that class.

Cheers,
Enrico

Thanks for the quick reply :slight_smile:

The problem is that the usual python spelling doesn’t work in the “real life” case here, because the branch name contains a # character which leads to the following problem (when used):

input_line_48:1:59: error: use of undeclared identifier 'vec'
namespace __rdf_0{ auto rdf_f = []() {return double_index(vec#0)
                                                          ^
Traceback (most recent call last):
  File "rdf_test.py", line 14, in <module>
    column = rdf.Define("double_idx", "double_index(vec#0)")
TypeError: can not resolve method template call for 'Define'

So that’s why I was hoping that it would be possible to use the overload that takes a vector of strings to specify the arguments to the functions.

Is there a way to make PyROOT explicitly aware of the fact that ROOT.double_index is a callable c++ type here?

For future reference: wrapping the whole call into a functor works fine, i.e.

struct IndexDoubler {
  std::vector<int> operator()(ROOT::VecOps::RVec<int> ind) { return double_index(ind); }
};

and then using it in python via

# rest of the setup
doubler = ROOT.IndexDoubler()
# still necessary in v6.20.04
vec_string = ROOT.std.vector('string')()
vec_string.push_back('vec#0')

column = rdf.Define("double_idx", doubler, vec_string) 

Nevertheless, I would still be interested in getting this to work with using the functions directly and not having to wrap them into a functor explicitly. Is the fact, that PyROOT does not implicitly convert functions into callable c++ types, something that can be fixed?

Not in general, you don’t necessarily have an efficient C++ equivalent for any Python callable.
We are cooking something to make RDF usage from Python much nicer, but that’s a few months away.

Another workaround for the vec#0 branch name could be using an alias though, it should work:

df.Alias("vec0", "vec#0").Define("double_idx", "double_index(vec0)")

Cheers,
Enrico

Thanks @eguiraud, @tmadlener, I checked that the Alias just works fine.
Looking forward the improved RDF from python!
cheers,
Clement

Hi Enrico,

The Alias trick is very neat.
Thanks for the quick help.

Cheers,
Thomas