Home | News | Documentation | Download

Make RDataFrames interoperable with other Python tools

ROOT Version: 6.24/00
Built for linuxx8664gcc on Oct 10 2021, 04:36:00


This topic is closely related to root-forum.cern.ch/t/add-new-column-to-rdataframe/34962/9 (sorry apparently I’m not allowed to put links in the post).

What I want to achieve is to add a column to a RDataFrame which is computed by some external python tool, eg:

prediction = model.predict(data)

model.predict might take an array or matrix and produces an array.
In the thread linked above, the following solution is proposed (and works!) where df is a RDataFrame:

df = df.Define("x", 'auto to_eval = "prediction[" + std::to_string(rdfentry_) + "]"; return float(TPython::Eval(to_eval.c_str()));')

However, it does not work when called inside a function or method call.

Here is a working example:

import ROOT
prediction = list(range(10))
df = ROOT.RDataFrame(10)
df = df.Define("x", 'auto to_eval = "prediction[" + std::to_string(rdfentry_) + "]"; return float(TPython::Eval(to_eval.c_str()));')
# works!!
import  ROOT

def add_to_df(df, column):
    df = df.Define("x", 'auto to_eval = "column[" + std::to_string(rdfentry_) + "]"; return float(TPython::Eval(to_eval.c_str()));')
    return df

prediction = list(range(10))
df = ROOT.RDataFrame(10)
df = add_to_df(df, prediction)

The second example works at first because of lazy evaluation, but as soon as df is accessed, eg via df.AsNumpy() it fails with the following error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
{'x': ndarray([-1., -1., -1., -1., -1., -1., -1., -1., -1., -1.], dtype=float32)}

So my questions are:

  • is this the recommended way to add externally (not with root) computed arrays to a RDataFrame? If yes, how do I make it work so that TPython::Eval can recognize also local variables?
  • if not, how should I add new columns to a RDataFrame instead?

Hi @davekch,

Welcome to the ROOT forum!

This is because column is not in scope by the time the expression is evaluated.

In principle, only global variables may be used in Python (in C++ you should be able to provide a lambda which captures some or all the variables in the scope where it is defined).
Also, see this related forum topic: RDataFrame: upload external variables to the Define.

FYI, @eguiraud.

Cheers,
J.

Hi,
thanks for getting back to me.

In the related thread you linked were some examples that made me finally understand how I can use ROOT.Numba.Declare and after fiddling around for a bit I came up with this code snippet that allows me to fill the column of a RDataFrame with an externally computated array.
This should work in whatever scope, I’m putting it inside a function here for demonstration’s sake:

import ROOT
import numpy as np

def add_to_df(df):
    prediction = np.arange(10, dtype=np.float)

    @ROOT.Numba.Declare(["int"], "float")
    def get_prediction(index):
        return prediction[index]

    df = df.Define("x", "Numba::get_prediction(rdfentry_)")
    return df

df = ROOT.RDataFrame(10)
df = add_to_df(df)
print(df.AsNumpy())

Note that prediction must not be a list but an array, otherwise numba can’t compile.

Hi,
this last snippet is the “correct” (as in, the nicest I have ever seen so far) solution.

On the roadmap for this year we have automatically decorating python functions passed to RDataFrame with @ROOT.Numba.Declare, so soon you will be able to write just:

prediction = np.arange(10, dtype=np.float)
df = df.Define("x", lambda rdfentry_: prediction[rdfentry_])

Cheers,
Enrico

1 Like

Hi @eguiraud ,
this sounds really great!

So, will Define then allow either a string containing c++ code or a python callable that takes an index as its only argument?

Either a string containing C++ code or a Python callable that takes any number of RDF columns as arguments and that can be digested by Numba. Basically we’ll just transform:

df.Define("x", lambda rdfentry_: prediction[rdfentry_])

into

@ROOT.Numba.Declare(["ULong64_t"], "float")
def func1(var1):
   return prediction[var1]

df.Define("x", "Numba::func1(rdfentry_)")

under the hood.

That’s the idea at least. But I don’t see any blockers, we are just saving the user some typing by adding information that RDF already knows (the type of the columns).

Cheers,
Enrico

@davekch i am quite interested In this thread, I suppose you have here a scikit MVA or a gbreweighter object from which you can predict a weight. Would you mind sharing a simple code showing how this works? I gave up in the past to bring the python based reweighters into a c++ application and make a Define ,but here it seems like you can work around it inside python .

right, there is a MVA classifier (PyFastBDT.FastBDT) object which has a predict method that takes an array of grouped values. The result of the classifier is then added back to the RDataFrame like shown above.
Here’s a short snippet of the relevant code:

x_predict = np.array(list(zip(*df.AsNumpy(columns=variables))))
pcs = classifier.predict(x_predict)

@ROOT.Numba.Declare(["int"], "float")
def get_pcs(index):
    return pcs[index]

df = df.Define(output_column, "Numba::get_pcs(rdfentry_)")
1 Like

@davekch thanks a lot. I Guess this is something working only single threaded right? As the rdfentry is something which doesn’t get shuffled around and you can trust that the whole array predict keeps getting aligned index wise to the AsNumpy return inputs ,correct?

1 Like

Uh yeah I should have mentioned that: rdfentry_ won’t be stable in multi-thread runs.

1 Like