Make RDataFrames interoperable with other Python tools

davekch · January 14, 2022, 10:39am

ROOT Version: 6.24/00
Built for linuxx8664gcc on Oct 10 2021, 04:36:00

This topic is closely related to root-forum.cern.ch/t/add-new-column-to-rdataframe/34962/9 (sorry apparently I’m not allowed to put links in the post).

What I want to achieve is to add a column to a RDataFrame which is computed by some external python tool, eg:

prediction = model.predict(data)

model.predict might take an array or matrix and produces an array.
In the thread linked above, the following solution is proposed (and works!) where df is a RDataFrame:

df = df.Define("x", 'auto to_eval = "prediction[" + std::to_string(rdfentry_) + "]"; return float(TPython::Eval(to_eval.c_str()));')

However, it does not work when called inside a function or method call.

Here is a working example:

import ROOT
prediction = list(range(10))
df = ROOT.RDataFrame(10)
df = df.Define("x", 'auto to_eval = "prediction[" + std::to_string(rdfentry_) + "]"; return float(TPython::Eval(to_eval.c_str()));')
# works!!

import  ROOT

def add_to_df(df, column):
    df = df.Define("x", 'auto to_eval = "column[" + std::to_string(rdfentry_) + "]"; return float(TPython::Eval(to_eval.c_str()));')
    return df

prediction = list(range(10))
df = ROOT.RDataFrame(10)
df = add_to_df(df, prediction)

The second example works at first because of lazy evaluation, but as soon as df is accessed, eg via df.AsNumpy() it fails with the following error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
  File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
{'x': ndarray([-1., -1., -1., -1., -1., -1., -1., -1., -1., -1.], dtype=float32)}

So my questions are:

is this the recommended way to add externally (not with root) computed arrays to a RDataFrame? If yes, how do I make it work so that TPython::Eval can recognize also local variables?
if not, how should I add new columns to a RDataFrame instead?

jalopezg · January 14, 2022, 11:30am

Hi @davekch,

Welcome to the ROOT forum!

This is because column is not in scope by the time the expression is evaluated.

In principle, only global variables may be used in Python (in C++ you should be able to provide a lambda which captures some or all the variables in the scope where it is defined).
Also, see this related forum topic: RDataFrame: upload external variables to the Define.

FYI, @eguiraud.

Cheers,
J.

davekch · January 14, 2022, 2:38pm

Hi,
thanks for getting back to me.

In the related thread you linked were some examples that made me finally understand how I can use ROOT.Numba.Declare and after fiddling around for a bit I came up with this code snippet that allows me to fill the column of a RDataFrame with an externally computated array.
This should work in whatever scope, I’m putting it inside a function here for demonstration’s sake:

import ROOT
import numpy as np

def add_to_df(df):
    prediction = np.arange(10, dtype=np.float)

    @ROOT.Numba.Declare(["int"], "float")
    def get_prediction(index):
        return prediction[index]

    df = df.Define("x", "Numba::get_prediction(rdfentry_)")
    return df

df = ROOT.RDataFrame(10)
df = add_to_df(df)
print(df.AsNumpy())

Note that prediction must not be a list but an array, otherwise numba can’t compile.

eguiraud · January 14, 2022, 2:47pm

Hi,
this last snippet is the “correct” (as in, the nicest I have ever seen so far) solution.

On the roadmap for this year we have automatically decorating python functions passed to RDataFrame with @ROOT.Numba.Declare, so soon you will be able to write just:

prediction = np.arange(10, dtype=np.float)
df = df.Define("x", lambda rdfentry_: prediction[rdfentry_])

Cheers,
Enrico

davekch · January 14, 2022, 2:55pm

Hi @eguiraud ,
this sounds really great!

So, will Define then allow either a string containing c++ code or a python callable that takes an index as its only argument?

eguiraud · January 14, 2022, 2:58pm

Either a string containing C++ code or a Python callable that takes any number of RDF columns as arguments and that can be digested by Numba. Basically we’ll just transform:

df.Define("x", lambda rdfentry_: prediction[rdfentry_])

into

@ROOT.Numba.Declare(["ULong64_t"], "float")
def func1(var1):
   return prediction[var1]

df.Define("x", "Numba::func1(rdfentry_)")

under the hood.

That’s the idea at least. But I don’t see any blockers, we are just saving the user some typing by adding information that RDF already knows (the type of the columns).

Cheers,
Enrico

RENATO_QUAGLIANI · January 14, 2022, 5:45pm

@davekch i am quite interested In this thread, I suppose you have here a scikit MVA or a gbreweighter object from which you can predict a weight. Would you mind sharing a simple code showing how this works? I gave up in the past to bring the python based reweighters into a c++ application and make a Define ,but here it seems like you can work around it inside python .

davekch · January 17, 2022, 3:23pm

right, there is a MVA classifier (PyFastBDT.FastBDT) object which has a predict method that takes an array of grouped values. The result of the classifier is then added back to the RDataFrame like shown above.
Here’s a short snippet of the relevant code:

x_predict = np.array(list(zip(*df.AsNumpy(columns=variables))))
pcs = classifier.predict(x_predict)

@ROOT.Numba.Declare(["int"], "float")
def get_pcs(index):
    return pcs[index]

df = df.Define(output_column, "Numba::get_pcs(rdfentry_)")

RENATO_QUAGLIANI · January 17, 2022, 6:53pm

@davekch thanks a lot. I Guess this is something working only single threaded right? As the rdfentry is something which doesn’t get shuffled around and you can trust that the whole array predict keeps getting aligned index wise to the AsNumpy return inputs ,correct?

eguiraud · January 17, 2022, 6:56pm

Uh yeah I should have mentioned that: rdfentry_ won’t be stable in multi-thread runs.

system · January 31, 2022, 6:57pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.