TF1.Eval() as a function in RDataframe

toicca · July 7, 2022, 10:04am

Hi!

I’m currently using pyROOT to fit a function to a histogram and after fitting I would like to use the TF1 function to evaluate points given in a RDataFrame column for example such as

dataframe = ROOT.RDataFrame("myTree", "myFile.root")
histogram = dataframe.Histo1D(("myHist", "myHist", 100, 0, 100), "x").GetValue()
histogram.Fit("myFunction")
function = histogram.GetFunction("myFunction")
dataframe1 = dataframe.Define("z", "function.Eval(y)") # This is what I'd like to do

How would you recommend passing the function to the dataframe from python? The best thing I can think of is to make the histogram and the fit in C++ and to use the ROOT.gInterpreter.Declare() to make the fit usable by the dataframe. It’s doable, but then it would probably be better to just switch to C++ to save me from the trouble of passing the dataframe loaded in python to C++.

Thanks

ROOT Version: 6.27.01
Platform: SWAN
Compiler: gcc11

bellenot · July 7, 2022, 10:12am

Welcome to the ROOT Forum!
I think @eguiraud or @etejedor can help you with this

eguiraud · July 7, 2022, 10:17am

Hi @toicca ,

we need to let the C++ side know about the Python function variable. Since the Python function variable is, in fact, just a wrapper for a C++ object, this is actually fairly easy to do.

I think we should have something like this in PyROOT itselfm but it’s not there at the moment. For now you can use this implementation by copy-pasting it:

import ROOT

def DeclareToCpp(**kwargs):
    for k, v in kwargs.items():
        ROOT.gInterpreter.Declare(f"namespace PyVars {{ auto &{k} = *reinterpret_cast<{type(v).__cpp_name__}*>({ROOT.addressof(v)}); }}")

h = ROOT.TH1D("h", "h", 100, -1, 1)
h.FillRandom("gaus", 42)
DeclareToCpp(h=h)
ROOT.RDataFrame(3).Define("z", "PyVars::h.GetEntries()").Display("z").Print()

Cheers,
Enrico

P.S.
of course you have to make sure that function does not go out of scope as long as it might be needed for RDF computations

toicca · July 8, 2022, 1:33pm

Thanks for the help @eguiraud,

I’d like to ask a follow up question(s). How should I implement this with a dictionary? Meaning that I have a code such as

# fits is an empty directory with floats as keys
# for example {1.0: None, 2.0: None, ... 300.0: None}
for idx, func in enumerate(fits):
    # MC and DT are TH2D
    h1 = MC.ProjectionY(f"MC{idx}", idx, idx)
    h2 = DT.ProjectionY(f"DT{idx}", idx, idx)
    h1.Divide(h2)
    h1.Fit("chebyshev4", "S")
    fits[func] = h1.GetFunction("chebyshev4")

and now I would like to declare the variables of the dictionary to C++. How would this be done? The end goal is to use a different function based on the value in a column of the RDF.

And then to go a bit deeper, I would also like to use Spark with the RDF’s, so the variables need to be spread among the workers. I’ve previously used ROOT.RDF.Experimental.Distributed.initialize to use custom C++ functions, for example (you’ve probaly seen/made this)

initialize = ROOT.RDF.Experimental.Distributed.initialize
def initialize():
    ROOT.gInterpreter.Declare("
        #ifndef MYFUN
        #define MYFUN
        int myfun(){ return 42; }
        #endif")

initialize(initialize)

But how would I spread the variables instead of the functions?

All in all I would like to do something like (in a simple case with two functions):

# dictionary with two functions
fits[1.0] = myHist.Fit("poly2")
fits[2.0] = myHist.Fit("poly4")

# Do something so that the Spark workers "know" about the fits
# ...

# Create a column to a ROOT.RDF.Experimental.Distributed.Spark.RDataFrame with the fitted functions
# If x columns value is less than 1 use the first function, else use the second function
df1 = df.Define("y", "x < 1.0 ? fits[1.0].Eval(y) : fits[2.0].Eval(y)")

the last line naturally wouldn’t work as I’m using Python syntax and a dictionary, and it comes back to my first question. Is this possible to do somehow?

Now, I’ve used floats as keys in all the examples, but it doesn’t have to be like that if that makes it particularly difficult.

Thanks for you time

eguiraud · July 11, 2022, 1:11pm

Hi @toicca ,

as you need to inject some non-trivial logic in your Define that makes use of several C++ types, I think your best bet is to write a C++ function that does what you need. You can put it in its own C++ header and then load it from disk in the initialize function.

By the way note that TF1::Eval is probably not thread-safe (@moneta can correct me if I’m wrong) so using it in a multi-thread RDF event loop might produce surprising results. But if each Spark worker is running a single-thread RDF event loop then things are fine.

Cheers,
Enrico

moneta · July 11, 2022, 1:27pm

Hi,

If you are just calling TF1::Eval passing the observable values (x) and not the parameters it should be thread-safe. If you are calling TF1::EvalPar passing different parameter values, then it is not thread safe, because the parameter values can be cached within the TF1 class.

Cheers

Lorenzo