Difference in RDataFrame filtering methods in python

lcorcodilos · November 22, 2019, 4:53pm

Hi all,

I’m curious about what’s happening in the back-end when the event loop starts for RDataFrame depending on how the lazy actions are built in python. So I have two dummy examples:

Method 1:

rdf = ROOT.RDataFrame()
first = rdf.Filter(“cut1”)
second = first.Filter(“cut2”)
aFork = second.Filter(“cutA”)
bFork = second.Filter(“cutB”)

Method 2:

rdf = ROOT.RDataFrame()
aFork = rdf.Filter(“cut1”).Filter(“cut2”).Filter(“cutA”)
bFork = rdf.Filter(“cut1”).Filter(“cut2”).Filter(“cutB”)

Once the loop is initiated, will one of these be faster or use less memory? Or does the class consider these to be equivalent?

Thanks very much!
Lucas

eguiraud · November 22, 2019, 5:26pm

Hi Lucas,
first is better: “cut1” and “cut2” are checked twice in the second case. RDF does not check whether the same Filter is booked twice (it could indeed for filter strings, but it could not in general for filter functions in C++).

Cheers,
Enrico

lcorcodilos · November 22, 2019, 5:45pm

Hi Enrico,

Thanks very much for the quick response! I have a follow up question. If I rewrote the first to,

rdf = ROOT.RDataFrame()
first = rdf.Filter(“cut1”).Filter(“cut2”)
aFork = first.Filter(“cutA”)
bFork = first.Filter(“cutB”)

would this consume any less memory or CPU time because I’ve only booked first and dropped second?

Thanks again,
Lucas

eguiraud · November 22, 2019, 5:56pm

Not really: the underlying computation graph would be the exact same.

lcorcodilos · December 3, 2019, 5:29pm

Great! Thanks very much!

system · December 17, 2019, 5:29pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.