Difference in RDataFrame filtering methods in python

Hi all,

I’m curious about what’s happening in the back-end when the event loop starts for RDataFrame depending on how the lazy actions are built in python. So I have two dummy examples:

Method 1:

rdf = ROOT.RDataFrame()
first = rdf.Filter(“cut1”)
second = first.Filter(“cut2”)
aFork = second.Filter(“cutA”)
bFork = second.Filter(“cutB”)

Method 2:

rdf = ROOT.RDataFrame()
aFork = rdf.Filter(“cut1”).Filter(“cut2”).Filter(“cutA”)
bFork = rdf.Filter(“cut1”).Filter(“cut2”).Filter(“cutB”)

Once the loop is initiated, will one of these be faster or use less memory? Or does the class consider these to be equivalent?

Thanks very much!
Lucas

Hi Lucas,
first is better: “cut1” and “cut2” are checked twice in the second case. RDF does not check whether the same Filter is booked twice (it could indeed for filter strings, but it could not in general for filter functions in C++).

Cheers,
Enrico

Hi Enrico,

Thanks very much for the quick response! I have a follow up question. If I rewrote the first to,

rdf = ROOT.RDataFrame()
first = rdf.Filter(“cut1”).Filter(“cut2”)
aFork = first.Filter(“cutA”)
bFork = first.Filter(“cutB”)

would this consume any less memory or CPU time because I’ve only booked first and dropped second?

Thanks again,
Lucas

Not really: the underlying computation graph would be the exact same.

Great! Thanks very much!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.