I have the following problem with RDF: I need to add a column to the RDF, and the content of the column cannot be easily computed from other columns already contained in the RDF.
What is the neatest way to do this (granting that the dirty ways are not very satisfactory)?
To give some context: this new column consist of the prediction of some machine learning algorithm (that it’s good to keep detached and standalone), so the simple example
auto df_with_define = df.Define("newColumn", "x*x + y*y");
Unfortunately it is non trivial to call Python code from inside C++. The simplest solution is to compute the ML values beforehand and attach them to the actual dataset as a friend tree.
resorting to friend trees is how we currently circumvent this limitation.
This is what we deem as “not very satisfactory”
From my very naive point of view of newcomer to data frames, I think that the possibility to easily stitch an additional column to an existing data frame - in python - is a critical feature.
Hi Riccardo,
it might be possible, but (currently) not so performant. We are working on better python integration.
See this post and the following: Rdataframe define column of same constant value . You can use TPython::Eval inside a Define/Filter lambda to get a python value into a dataframe expression.
If the result of the model evaluation fits in memory, the most performant way is still to do a single evaluation of the whole dataset, and then index the result. Something like (haven’t tested it):
import ROOT
from ROOT import TFile, TTree
from array import array
import numpy as np
f = TFile( 'test.root', 'recreate' )
t = TTree( 'tree', 'tree without histos' )
nevents = 1000
n = array( 'i', [ 0 ] )
d = array( 'f', [ 0.] )
t.Branch( 'mynum', n, 'mynum/I' )
t.Branch( 'myval', d, 'myval/F' )
for i in range(nevents):
n[0] = i
d[0] = np.random.normal()
t.Fill()
f.Write()
f.Close()
predictions = np.random.normal(100, 1, nevents)
df = ROOT.RDataFrame('tree', 'test.root')
# https://root-forum.cern.ch/t/add-new-column-to-rdataframe/34962/6
df = df.Define("prediction", 'float(TPython::Eval("predictions[rdfentry_]"))')
hh = df.Histo1D('prediction')
hh.Draw()
as far as I understand, it fails because rdfentry_ in not accessible from within Eval.
Then I thought I could just pass the whole numpy array to TPython::Eval (that is I removed [rdfentry_])but it seems that Define can only handle single floats, not arrays / vectors of floats.
In [4]: hh.Draw()
...:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<string> in <module>()
NameError: name 'rdfentry_' is not defined
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
TypeError: a float is required
---------------------------------------------------------------------------
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
TypeError: only size-1 arrays can be converted to Python scalars
---------------------------------------------------------------------------
(works, but might be quite slow depending on the usecase).
Regarding the second error: it’s because you are casting that TPython::Eval result (which is a collection) to a float – of course that can’t work. I don’t think TPython::Eval supports converting python collections to C++ collections, see docs for the returned object TPyReturn.
Yes, it turns out that calling python from C++ from python is ugly. Friend trees are nicer, probably faster. @swunsch has a nicer solution for the new, experimental PyROOT (available if you build ROOT yourself) that might be closer to the elegance/performance sweet spot.
From the technical side: The problem is that this will be always not thread safe and also interferes with the global interpreter lock of Python in the multi-threaded case.
From the ML/algorithmic side: As soon as you use neural network, a batch inference will always be much much faster than event-by-event unless you filter your dataset massively beforehand. So the friend tree solution would be most suitable.
In case you’re just executing something “simple” in python, you can use the solution above at the speed of C++ since we use numba if possible to jit the thingy into compiled code
Best
Stefan
Edit: Even though the code above is not suited for multi-threading, we have protected the calls with a lock!