Problem calling RDataFrame::Define from PyROOT with plain function

I currently have some troubles when trying to define a new column in an RDataFrame:

I have created a simple test file that contains a single branch as a std::vector<int>

TFile* f = new TFile("test_file.root", "recreate");
TTree* t = new TTree("tree", "tree with some values");

auto* vec = new std::vector<int>();
vec->reserve(5);
t->Branch("vec", &vec);

for (int i = 0; i < 10; ++i) {
  vec->clear();
    for (int j = 0; j < 5; ++j) {
      vec->push_back((int) gRandom->Uniform(0, 10));
    }
  t->Fill();
}
t->Write();
f->Write();
f->Close();

When I then try to Define a new column in an RDataFrame using

import ROOT
ROOT.gInterpreter.ProcessLine('''
std::vector<int> double_index(ROOT::VecOps::RVec<int> ind) {
  std::vector<int> indices;
  indices.reserve(ind.size());
  for (const auto i : ind) {
    indices.push_back(i * 2);
  }
  return indices;
}''')

rdf = ROOT.RDataFrame('tree', 'test_file.root')
rdf.Define('double_idx', ROOT.double_index, ['vec'])

I get the following error

Traceback (most recent call last):
  File "rdf_new.py", line 14, in <module>
    rdf.Define('double_idx', ROOT.double_index, ['vec'])
TypeError: can not resolve method template call for 'Define'

The problem in this case does not seem to be the ['vec'] that is not correctly identified as a std::vector<std::string> but rather the double_index function. I have tried the workaround for specifying the last argument as an actual vector of strings, outlined in Problem callding RDataFrame::Define with ColumnNames_t with python and that leads to the same error.

I am not sure whether I am running into a bug here, or whether I have to call Define in a slightly different way. Any help is very much appreciated.


Please read tips for efficient and successful posting and posting code

_ROOT Version: 6.20.04
_Platform: CentOS7
_Compiler: gcc8.3.0


Hi,
the Python spelling is

rdf.Define('double_idx', 'double_index(vec)')

the reason being that ROOT.double_index is not implicitly convertible to a C++ callable type by PyROOT (not 100% sure why).

Alternatively you can do like in the thread you linked to and define a functor type (a class with a call operator) and pass an instance of that class.

Cheers,
Enrico

Thanks for the quick reply :slight_smile:

The problem is that the usual python spelling doesn’t work in the “real life” case here, because the branch name contains a # character which leads to the following problem (when used):

input_line_48:1:59: error: use of undeclared identifier 'vec'
namespace __rdf_0{ auto rdf_f = []() {return double_index(vec#0)
                                                          ^
Traceback (most recent call last):
  File "rdf_test.py", line 14, in <module>
    column = rdf.Define("double_idx", "double_index(vec#0)")
TypeError: can not resolve method template call for 'Define'

So that’s why I was hoping that it would be possible to use the overload that takes a vector of strings to specify the arguments to the functions.

Is there a way to make PyROOT explicitly aware of the fact that ROOT.double_index is a callable c++ type here?

For future reference: wrapping the whole call into a functor works fine, i.e.

struct IndexDoubler {
  std::vector<int> operator()(ROOT::VecOps::RVec<int> ind) { return double_index(ind); }
};

and then using it in python via

# rest of the setup
doubler = ROOT.IndexDoubler()
# still necessary in v6.20.04
vec_string = ROOT.std.vector('string')()
vec_string.push_back('vec#0')

column = rdf.Define("double_idx", doubler, vec_string) 

Nevertheless, I would still be interested in getting this to work with using the functions directly and not having to wrap them into a functor explicitly. Is the fact, that PyROOT does not implicitly convert functions into callable c++ types, something that can be fixed?

Not in general, you don’t necessarily have an efficient C++ equivalent for any Python callable.
We are cooking something to make RDF usage from Python much nicer, but that’s a few months away.

Another workaround for the vec#0 branch name could be using an alias though, it should work:

df.Alias("vec0", "vec#0").Define("double_idx", "double_index(vec0)")

Cheers,
Enrico

Thanks @eguiraud, @tmadlener, I checked that the Alias just works fine.
Looking forward the improved RDF from python!
cheers,
Clement

1 Like

Hi Enrico,

The Alias trick is very neat.
Thanks for the quick help.

Cheers,
Thomas

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.