ROOT Version: 6.24/00
Built for linuxx8664gcc on Oct 10 2021, 04:36:00
This topic is closely related to root-forum.cern.ch/t/add-new-column-to-rdataframe/34962/9 (sorry apparently I’m not allowed to put links in the post).
What I want to achieve is to add a column to a RDataFrame which is computed by some external python tool, eg:
prediction = model.predict(data)
model.predict
might take an array or matrix and produces an array.
In the thread linked above, the following solution is proposed (and works!) where df
is a RDataFrame
:
df = df.Define("x", 'auto to_eval = "prediction[" + std::to_string(rdfentry_) + "]"; return float(TPython::Eval(to_eval.c_str()));')
However, it does not work when called inside a function or method call.
Here is a working example:
import ROOT
prediction = list(range(10))
df = ROOT.RDataFrame(10)
df = df.Define("x", 'auto to_eval = "prediction[" + std::to_string(rdfentry_) + "]"; return float(TPython::Eval(to_eval.c_str()));')
# works!!
import ROOT
def add_to_df(df, column):
df = df.Define("x", 'auto to_eval = "column[" + std::to_string(rdfentry_) + "]"; return float(TPython::Eval(to_eval.c_str()));')
return df
prediction = list(range(10))
df = ROOT.RDataFrame(10)
df = add_to_df(df, prediction)
The second example works at first because of lazy evaluation, but as soon as df
is accessed, eg via df.AsNumpy()
it fails with the following error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
Traceback (most recent call last):
File "<string>", line 1, in <module>
NameError: name 'column' is not defined
TypeError: must be real number, not NoneType
{'x': ndarray([-1., -1., -1., -1., -1., -1., -1., -1., -1., -1.], dtype=float32)}
So my questions are:
- is this the recommended way to add externally (not with root) computed arrays to a RDataFrame? If yes, how do I make it work so that
TPython::Eval
can recognize also local variables? - if not, how should I add new columns to a RDataFrame instead?