Multithreaded Range in RDataFrame

FoxWise · October 8, 2021, 2:24pm

import ROOT
# ROOT.EnableImplicitMT()
df1 = df1.Range(42)
df2 = df2.Filter(" rdfentry_ < 42")

Doesn’t work with multi-threading
Works with multi-threading and does the same functionality…

Why one would need Range at all then?
Or
Why one wouldn’t implement Range internally as a Filter to provide multi-threading?

eguiraud · October 8, 2021, 2:42pm

Hi @FoxWise ,

Range makes RDataFrame quit the event loop early when all ranges are exhausted, even if you have different ranges in different branches of the computation graph and even if you have several Ranges in series. Being more generic, Filter can never know whether it’s possible to quit the event loop early
you can put a Range after a filter to only process N events that pass the filter, which you can’t do by checking rdfentry_

Unfortunately these two features of Range also mean that it is not possible to implement a multi-thread Range that does not require some form of synchronization between the threads, and we want to avoid synchronization during the event loop at all costs in order to scale well to a large amount of CPUs in most scenarios.

Currently, the way to get a multi-thread RDF run that early-quits after a certain amount of entries is by attaching a TEntryList to the TTree/TChain before passing the TTree/TChain object to RDF’s constructor.
We want to allow users to get the same results with less characters typed in the future, see https://github.com/root-project/root/issues/7702 .

Cheers,
Enrico

RENATO_QUAGLIANI · October 8, 2021, 3:17pm

Don’t get me wrong, but isn’t the scope of Range a debugging feature rather than a functionality to use for any meaningful work ?
I usually make use of Range to test some rdf routines and histogramming pipelines i have etc… so i can run in no time what i used to do with Draw("","","",1000) for fast checks etc…
I.e you have no gain in going MT if you run over only 1000 events.

Maybe this answer the question pragmatically Why one would need Range at all then?

eguiraud · October 8, 2021, 3:29pm

Yes, this was the original idea behind the feature.

FoxWise · October 8, 2021, 4:14pm

I also use Range primarily for testing, but I encountered another use case:

I have many TTrees with almost 10 000 000 events each due to some jobs failed and statistics slightly different. And I wanted to cut the extra events in some TTrees to the least number of events, so there are no leftovers…

system · October 22, 2021, 4:15pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.