Adding a new branch from a python ML model

jcob · January 8, 2021, 1:13pm

Hi,

I would like to define a new branch/column in an RDataFrame similar to something here: Add new column to RDataFrame

The decorator @ROOT.DeclareCppCallable doesn’t seem to exist although I’m not using experimental pyROOT so I could be wrong. However in the most recent version of ROOT support has been added for Numba callables. Is there a way to do this?

Some skeleton code to show what I would like to do



model = Some_sklearn_model()

@ROOT.DeclareCppCallable(["float"] * 2, "float")
def predictModel(var1, var2):
    return model.predict([var1, var2])

df = ROOT.ROOT.RDataFrame(10).Define("x", "CppCallable::predictModel(var1, var2)")

Thanks!

Please read tips for efficient and successful posting and posting code

_ROOT Version: 6.22.06

eguiraud · January 8, 2021, 1:49pm

Hi,
the feature changed a bit from when that comment was posted, see this NumbaDeclare tutorial.

Hope this helps!
Enrico

jcob · January 8, 2021, 2:49pm

Hi @eguiraud,

Thanks for pointing me to this. I tried something like this but it doesn’t seem to work. In my case, the classifier I am using requires a pandas dataframe as input rather than just an array. My implementation is as follows:


my_model = classifiers['KnnFlatness']
@ROOT.Numba.Declare(['float'] * 4, 'float')
def decision_value(d0_pt,pis_pt,chi2,doca):
    frame = pd.DataFrame({'Dst_ReFit_D0_PT':d0_pt,'Pi_slow_ReFit_PT':pis_pt,'Dst_ReFit_chi2_best':chi2,'D0_Loki_AMAXDOCA':doca})
    return my_model.decision_function(frame)

but this fails with this error

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'my_model': Cannot determine Numba type of <class 'hep_ml.gradientboosting.UGradientBoostingClassifier'>

File "<ipython-input-74-78e12f83008f>", line 4:
def decision_value(d0_pt,pis_pt,chi2,doca):
    <source elided>
    frame = pd.DataFrame({'Dst_ReFit_D0_PT':d0_pt,'Pi_slow_ReFit_PT':pis_pt,'Dst_ReFit_chi2_best':chi2,'D0_Loki_AMAXDOCA':doca})
    return my_model.decision_function(frame)

It looks like only supported Numba types are allowed. Is there a way to call functions on non-numba python objects? The example in the thread I linked to wraps things inside a class. Is this necessary?

Thanks

eguiraud · January 8, 2021, 2:57pm

You can try with just the @numba.jit decorator: if it can jit your code in nopython mode, it should work with ROOT.Numba.Declare too. However I don’t think numba knows (or can know) how to create low-level code that corresponds to that my_model.decision_function call

eguiraud · January 8, 2021, 3:03pm

One workflow that’s available is applying the model in Python and then save the numpy array with the classification results in a TTree using ROOT.RDF.MakeNumpyDataFrame and Snapshot.

That gives you a separate TTree that you can use together with the original TTree, as its “friend”, as if they were a single TTree. @swunsch might have further comments.

jcob · January 9, 2021, 12:36pm

A workaround suggested in the thread I linked to uses Define with TPython::Eval(values[rdfentry_]) (with some string formatting). This works in principle but because I filter some events in another frame, sometimesrdfentry_ is outside of the array length. Is there a way to “reindex” an RDataFrame, i.e. for a new dataframe make a new column going from 0 to the length of index? I can’t think of a C++ way to do this from existing branches.

jcob · January 9, 2021, 1:14pm

Actually another workaround, I used a python dictionary with the key as rdfentry_. This works but isn’t very performant so any other suggestions would be helpful.

Thanks

eguiraud · January 11, 2021, 8:17am

In general, calling Python code (via TPython or otherwise) from the C++ event loop is not going to have good performance.

If I understand the question correctly, Filter+Cache or Filter+Snapshot is what you might be looking for.

eguiraud · January 11, 2021, 12:09pm

What about my friend tree suggestion above?

system · January 25, 2021, 12:09pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.