Add new column to RDataFrame

Hi!

I have the following problem with RDF: I need to add a column to the RDF, and the content of the column cannot be easily computed from other columns already contained in the RDF.
What is the neatest way to do this (granting that the dirty ways are not very satisfactory)?

To give some context: this new column consist of the prediction of some machine learning algorithm (that it’s good to keep detached and standalone), so the simple example

auto df_with_define = df.Define("newColumn", "x*x + y*y");

does not apply.

Thanks,
Riccardo

Hi,
from your snippet it looks like you use C++. The easier way, then, is to use lambda captures:

double some_other_variable = 42;
auto df2 = df.Define("newColumn", [=] { return some_other_variable; });

or

auto ml_model = SomeMachineLearningTool(...);
auto df2 = df.Define("prediction",
                     [&ml_model] (double column_value) { return ml_model.predict(column_value); },
                     {"column_name"});

Cheers,
Enrico

Thanks Enrico.
Actually I am using python, I just copy-pasted the example from the RDF docs.
Riccardo

Hi!

Unfortunately it is non trivial to call Python code from inside C++. The simplest solution is to compute the ML values beforehand and attach them to the actual dataset as a friend tree.

Does this work for you?

Best
Stefan

Hi Stefan,

resorting to friend trees is how we currently circumvent this limitation.
This is what we deem as “not very satisfactory” :wink:

From my very naive point of view of newcomer to data frames, I think that the possibility to easily stitch an additional column to an existing data frame - in python - is a critical feature.

Cheers,
Riccardo

Hi Riccardo,
it might be possible, but (currently) not so performant. We are working on better python integration.

See this post and the following: Rdataframe define column of same constant value . You can use TPython::Eval inside a Define/Filter lambda to get a python value into a dataframe expression.

If the result of the model evaluation fits in memory, the most performant way is still to do a single evaluation of the whole dataset, and then index the result. Something like (haven’t tested it):

predictions = ml_model.predict(data)
df = df.Define("prediction", 'float(TPython::Eval("predictions[rdfentry_]"))')

Hi Enrico,

here’s a minimal version of my attempt

import ROOT
from ROOT import TFile, TTree
from array import array
import numpy as np

f = TFile( 'test.root', 'recreate' )
t = TTree( 'tree', 'tree without histos' )

nevents = 1000

n = array( 'i', [ 0 ] )
d = array( 'f', [ 0.] )
t.Branch( 'mynum', n, 'mynum/I' )
t.Branch( 'myval', d, 'myval/F' )
 
for i in range(nevents):
    n[0] = i
    d[0] = np.random.normal()
    t.Fill()
 
f.Write()
f.Close()

predictions = np.random.normal(100, 1, nevents)

df = ROOT.RDataFrame('tree', 'test.root')

# https://root-forum.cern.ch/t/add-new-column-to-rdataframe/34962/6
df = df.Define("prediction", 'float(TPython::Eval("predictions[rdfentry_]"))')

hh = df.Histo1D('prediction')
hh.Draw()

as far as I understand, it fails because rdfentry_ in not accessible from within Eval.
Then I thought I could just pass the whole numpy array to TPython::Eval (that is I removed [rdfentry_])but it seems that Define can only handle single floats, not arrays / vectors of floats.

Thanks

Hi,
what’s the error you get with rdfentry_?

Also, Define should be able to handle any non-pointer, non-reference C++ type.

Cheers,
Enrico

here’s what I get with the example above

In [4]: hh.Draw()
   ...:
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<string> in <module>()

NameError: name 'rdfentry_' is not defined
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: a float is required
---------------------------------------------------------------------------

wheras, if I change the relevant line to

df = df.Define("prediction", 'float(TPython::Eval("predictions"))')

I get

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: only size-1 arrays can be converted to Python scalars
---------------------------------------------------------------------------

Alright, thanks!

Regarding the first problem: as most untested code, it was wrong :sweat_smile: This works on ROOT v6.16:

>>> import ROOT
>>> a = range(10)
>>> df = ROOT.RDataFrame(10)
>>> df = df.Define("x", 'auto to_eval = std::string("a[") + std::to_string(rdfentry_) + "]"; return float(TPython::Eval(to_eval.c_str()));')
>>> display = df.Display()
>>> display.Print()
x        | 
0.00000f | 
1.00000f | 
2.00000f | 
3.00000f | 
4.00000f | 

(works, but might be quite slow depending on the usecase).

Regarding the second error: it’s because you are casting that TPython::Eval result (which is a collection) to a float – of course that can’t work. I don’t think TPython::Eval supports converting python collections to C++ collections, see docs for the returned object TPyReturn.

Cheers,
Enrico

I’ve just tried it out, it works perfectly!
Now, the toy example is sorted out, I hope this is still fast enough for the real life case!

Thanks a lot,
Riccardo

Good luck!

Yes, it turns out that calling python from C++ from python is ugly. Friend trees are nicer, probably faster.
@swunsch has a nicer solution for the new, experimental PyROOT (available if you build ROOT yourself) that might be closer to the elegance/performance sweet spot.

Hi!

Probably it does not help you, but let me show you what we already have for experimental PyROOT in 6.18 (and planned to be “standard” in 6.20):

import ROOT

class AwesomeModel:
    def predict(self, x):
        return x[0] * x[1]

model = AwesomeModel()

@ROOT.DeclareCppCallable(["float"] * 2, "float")
def predictModel(var1, var2):
    return model.predict([var1, var2])

df = ROOT.ROOT.RDataFrame(10).Define("x", "CppCallable::predictModel(var1, var2)")
print(df.AsNumpy())

From the technical side: The problem is that this will be always not thread safe and also interferes with the global interpreter lock of Python in the multi-threaded case.

From the ML/algorithmic side: As soon as you use neural network, a batch inference will always be much much faster than event-by-event unless you filter your dataset massively beforehand. So the friend tree solution would be most suitable.

In case you’re just executing something “simple” in python, you can use the solution above at the speed of C++ since we use numba if possible to jit the thingy into compiled code :slight_smile:

Best
Stefan

Edit: Even though the code above is not suited for multi-threading, we have protected the calls with a lock!

2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

In case someone stumbles upon this post: DeclareCppCallable is now in production as ROOT.Numba.Declare, a tutorial is available here.

2 Likes