OnPartialResult and progress bar with PyROOT

Hi everyone,

Every once in a while, I get annoyed with having no idea how things are progressing with my RDF code (built primarily in PyROOT, with more complicated C++ functions declared in a header/src file).

Over the last few months I’ve made several failed attempts at replicating progress bars, or updating histograms using OnPartialResult from the PyROOT side. A few attempts made with callables, C++ lambdas, print statements inside JIT’d code looking at rdfentry_, etc. I’m not familiar enough with crossing the C++/python boundaries to have succeeded with them, yet. The only thing I’ve thought of trying but haven’t so far is creating a C++ function to do everything (book a histogram, register OnPartialResult, etc.), passing to it the initialized dataframe from python and getting an RNode back, to continue with the rest of the code. But this all feels very third-class for the PyROOT interface to RDF.

Is it possible to do this in a way that isn’t very messy? It would be nice if it were simply a standard method, as straightforward as booking new columns, but given it isn’t yet, how can one go about this?

Thanks for any pointers,
Nick

Hi,
I think some options for this were discussed in RResultPtr::OnPartialResult function in pyROOT (I added it to my reading list but didn’t get around trying it).
Cheers,
Pieter

1 Like

Hi,
I am afraid OnPartialResult does not play well with python (yet). Feel free to open a jira ticket with a request for improvement.

In the meanwhile, you need a C++ helper. For example, a better version of:

import ROOT

ROOT.gInterpreter.Declare("""
    ROOT::RDF::RResultPtr<ULong64_t> AddProgressBar(ROOT::RDF::RNode df) {
        auto c = df.Count();
        c.OnPartialResult(/*every=*/10, [] (ULong64_t e) { std::cout << e << std::endl; });
        return c;
    }
""")

df = ROOT.RDataFrame(1000)
count = ROOT.AddProgressBar(ROOT.RDF.AsRNode(df))
count.GetValue() # run event loop (and PrintProgress every 10 entries)

Remember that in multi-thread applications, OnPartialResult only executes the callback on one of the running threads. OnPartialResultSlot is executed for each thread/processing slot, but then the callback requires some synchronization mechanism so the threads do not write on each other’s streams.

Cheers,
Enrico

Thanks a lot, Enrico! I’ll see about inserting a mutex and such for MT, but just a configurable printout is nice. If I make something noticeably nicer, I’ll share it here.

In case it’s of interest to others, an implementation of a progress bar based on the dataframe tutorials + Enrico’s help above:

ROOT.gInterpreter.Declare("""
    const UInt_t barWidth = 60;
    ULong64_t processed = 0, totalEvents = 0;
    std::string progressBar;
    std::mutex barMutex;
    auto registerEvents = [](ULong64_t nIncrement) {totalEvents += nIncrement;};

    ROOT::RDF::RResultPtr<ULong64_t> AddProgressBar(ROOT::RDF::RNode df, int everyN=10000, int totalN=100000) {
        registerEvents(totalN);
        auto c = df.Count();
        c.OnPartialResultSlot(everyN, [everyN] (unsigned int slot, ULong64_t &cnt){
            std::lock_guard<std::mutex> l(barMutex);
            processed += everyN; //everyN captured by value for this lambda
            progressBar = "[";
            for(UInt_t i = 0; i < static_cast<UInt_t>(static_cast<Float_t>(processed)/totalEvents*barWidth); ++i){
                progressBar.push_back('|');
            }
            // escape the '\' when defined in python string
            std::cout << "\\r" << std::left << std::setw(barWidth) << progressBar << "] " << processed << "/" << totalEvents << std::flush;
        });
        return c;
    }
""")

Used as follows, if rdf_i, rdf_i_events are the RDataFrame and total events for each respective sample being processed. The 5000 is used to make the counter printout increment more frequently than the bar would (being less than terminal width)

counts_1 = ROOT.AddProgressBar(ROOT.RDF.AsRNode(rdf_1, max(100, int(rdf_1_events/5000)), int(rdf_1_events))
counts_2 = ROOT.AddProgressBar(ROOT.RDF.AsRNode(rdf_2, max(100, int(rdf_2_events/5000)), int(rdf_2_events))

The output will have a progress bar as well as a counter of processed/total(inclusive of all samples), the latter being nice if there’s a lot of events being processed.