RDataFrame speed vs TTree loop for histogram creation

LucaC · January 22, 2021, 2:38pm

Dear experts,

I would like to ask for your feedback to understand how to build an efficient code with RDataFrames, as so far the performance I get on a simple example is O(10)x slower than a code with a TTree loop that implements the same functionality of histogram filling.

My use case is something I believe to be rather standard: plotting the distributions of several branches in a Tree under a few selections (example: control plots in different analysis regions).
Ideally this should be something flexible where variables, weights and selections are easily configurable (e.g. from a cfg file), so RDataFrames look very suitable for this and can use the python interface.

To create a “toy example”, I build an artificial dataset containing a Tree of 100k entries and 100 branches (x_0, … x_99). The code in case you want to reproduce it is below [1].
I use these data to fill a plot of each variable, with a weight given by x_0 * x_1, under 3 selections:

all events
x_99 < 0
x_99 < 0 && x_98 < 0 && x_97 < 0

A minimal code that implements this it in python with a RDataFrame is below [2], and on my machine it takes approximately 33s to run. I compared it against an equivalent compiled C++ version, but performance is roughly the same (as expected, since the same backend code is called).

If I implement the same functionality with a loop on a TTree as in example [3], however, the execution of the code takes only 3 to 4 seconds, so it’s about 10 times faster.

Is this difference in the performance expected, or is there something that is slowing down my RDataFrame-based code that can be fixed?
Of course multithreading can help, although it looks like I may need to recompile ROOT with the support enabled, as I am using the precompiled version and enabling MT has no effect.
But in any case, I would need to run this code in parallel over different datasets, so that a “trivial” parallelisation of one execution for one dataset in a single thread may be the best thing to do anyway.

Thanks in advance for your feedback!
Cheers,
Luca

[1]

// root -l generate_dataset()

void generate_dataset()
{
    TFile* fOut = new TFile("dataset.root", "recreate");
    // TFile* fOut = new TFile("dataset_100k.root", "recreate");
    TTree* tOut = new TTree("tree", "tree");

    TRandom3* rndm = new TRandom3(0);

    const uint nEntries = 10000000; // number of entries to generate
    // const uint nEntries = 100000; // number of entries to generate

    const int nbranches = 100;
    std::vector<double> vals(nbranches); // for setbranchaddress

    cout << "... will generate " << nbranches << " branches" << endl;
    cout << "... will generate " << nEntries  << " entries" << endl;

    for (uint ib = 0; ib < nbranches; ++ib){
        tOut->Branch(Form("x_%i", ib), &vals.at(ib));
    }

    for (uint ie = 0; ie < nEntries; ++ie){
        if (ie % 100000 == 0) cout << ie << " / " << nEntries << endl;
        for (uint ib = 0; ib < nbranches; ++ib){
            vals.at(ib) = rndm->Gaus(0, 1);
        }
        tOut->Fill();
    }

    tOut->Write();
}

[2]

# python -u rootDataFrame_plots.py

import ROOT
ROOT.gROOT.SetBatch(True)

# ROOT.EnableImplicitMT(4)

# dtset_name = 'dataset.root'
dtset_name = 'dataset_100k.root'

print('... opening file', dtset_name)

d = ROOT.RDataFrame("tree", dtset_name)

print('... defining weights')

d = d.Define('col_w', 'x_0 * x_1')

print('... applying selections')

cut1 = "(x_99 < 0)"
cut2 = "(x_99 < 0 && x_98 < 0 && x_97 < 0)"

d_incl = d
d_cut1 = d.Filter(cut1)
d_cut2 = d.Filter(cut2)

print('... creating branchlist')

branchlist = ['x_{}'.format(i) for i in range (100)]

print('... making plots')

c1 = ROOT.TCanvas('c1', 'c1')

for ib in range(len(branchlist)):
    print('... doing', branchlist[ib])
    h_incl = d_incl.Histo1D( ("h_{}_incl".format(ib), "h_{}_incl".format(ib), 100, -3, 3) , 'x_{}'.format(ib), 'col_w')
    h_cut1 = d_cut1.Histo1D( ("h_{}_cut1".format(ib), "h_{}_cut1".format(ib), 100, -3, 3) , 'x_{}'.format(ib), 'col_w')
    h_cut2 = d_cut2.Histo1D( ("h_{}_cut2".format(ib), "h_{}_cut2".format(ib), 100, -3, 3) , 'x_{}'.format(ib), 'col_w')
    h_incl.Draw('hist')
    c1.Print('rdfplots/h_%i_incl.pdf' % ib)
    h_cut1.Draw('hist')
    c1.Print('rdfplots/h_%i_cut1.pdf' % ib)
    h_cut2.Draw('hist')
    c1.Print('rdfplots/h_%i_cut2.pdf' % ib)

[3]

// c++ -lm -o rootTreeLoop_plots rootTreeLoop_plots.cpp `root-config --glibs --cflags`

#include "TString.h"
#include "TH1D.h"
#include "TTree.h"
#include "TFile.h"
#include "TTreeFormula.h"
#include "TCanvas.h"
#include <iostream>
#include <vector>

using namespace std;

int main() {
    TString file_in = "dataset_100k.root";
    // TString file_in = "dataset.root";

    cout << "... opening file: " << file_in << endl;
    
    TFile* fIn = TFile::Open(file_in);
    TTree* tIn = (TTree*) fIn->Get("tree");

    cout << "Making branch list" << endl;
    std::vector<TString> branchList;
    for (uint i = 0; i < 100; ++i)
        branchList.push_back(Form("x_%i", i));

    cout << "Setting branch address" << endl;
    std::vector<double> vals(branchList.size());
    for (uint ib = 0; ib < branchList.size(); ++ib){
        tIn->SetBranchAddress(Form("x_%i", ib), &vals.at(ib));
    }

    cout << "Making histograms" << endl;
    std::vector<TH1D*> histos_incl;
    std::vector<TH1D*> histos_cut1;
    std::vector<TH1D*> histos_cut2;

    for (uint ib = 0; ib < branchList.size(); ++ib){

        TH1D* h_incl = new TH1D(Form("h_%s_incl", branchList.at(ib).Data()), Form("h_%s_incl", branchList.at(ib).Data()), 100, -3, 3);
        TH1D* h_cut1 = new TH1D(Form("h_%s_cut1", branchList.at(ib).Data()), Form("h_%s_cut1", branchList.at(ib).Data()), 100, -3, 3);
        TH1D* h_cut2 = new TH1D(Form("h_%s_cut2", branchList.at(ib).Data()), Form("h_%s_cut2", branchList.at(ib).Data()), 100, -3, 3);

        histos_incl.push_back(h_incl);
        histos_cut1.push_back(h_cut1);
        histos_cut2.push_back(h_cut2);
    }

    cout << "Making formulas" << endl;

    TTreeFormula* TTF_cut1 = new TTreeFormula ("TTF_cut1", "(x_99 < 0)", tIn);
    TTreeFormula* TTF_cut2 = new TTreeFormula ("TTF_cut2", "(x_99 < 0 && x_98 < 0 && x_97 < 0)", tIn);

    cout << "Making plots" << endl;

    const uint nEvts = tIn->GetEntries();
    for (uint iEv = 0; iEv < nEvts; ++iEv) {
        if (iEv % 10000 == 0)
            cout << iEv << " / " << nEvts << endl;

        tIn->GetEntry(iEv);

        bool pass_cut1 = TTF_cut1->EvalInstance();
        bool pass_cut2 = TTF_cut2->EvalInstance();

        for (uint ib = 0; ib < branchList.size(); ++ib){
            histos_incl.at(ib)->Fill(vals.at(ib), vals.at(0)*vals.at(1));
            if(pass_cut1) histos_cut1.at(ib)->Fill(vals.at(ib), vals.at(0)*vals.at(1));
            if(pass_cut2) histos_cut2.at(ib)->Fill(vals.at(ib), vals.at(0)*vals.at(1));
        }
    }

    cout << "Saving pdf of plots" << endl;
    TCanvas* c1 = new TCanvas("c1", "c1");
    for (uint ib = 0; ib < branchList.size(); ++ib){
        histos_incl.at(ib)->Draw("hist");
        c1->Print(Form("treeloopplots/h_%s_incl.pdf", branchList.at(ib).Data()));
        histos_cut1.at(ib)->Draw("hist");
        c1->Print(Form("treeloopplots/h_%s_cut1.pdf", branchList.at(ib).Data()));
        histos_cut2.at(ib)->Draw("hist");
        c1->Print(Form("treeloopplots/h_%s_cut2.pdf", branchList.at(ib).Data()));
    }
}

Please read tips for efficient and successful posting and posting code

ROOT Version: 6.22/06
Platform: macOS, 10.15.7
Compiler: Apple clang version 12.0.0

beojan · January 22, 2021, 2:55pm

You’ll have a difference because your TTree loop is compiled C++ and your RDataFrame code is interpreted Python.
Your RDataFrame code also runs over your data once for each histogram, while your TTree code only does this once. To run over your data only once with RDataFrame you would have to loop once to create the RResultPtr proxy objects, then loop over these to save the histograms (the event loop would be triggered when you save the first histogram).

eguiraud · January 22, 2021, 3:07pm

Hi,
to clarify: the event loop runs in C++ in both examples. The difference is that in one case it’s C++ compiled with optimizations, and in the other it’s C++ just-in-time compiled without optimizations.

@beojan correctly points out that when many histograms are produced, even the Python RDF version would easily be faster than the C++ version because it can produce all histograms reading the data once rather than once per histogram.

When a single histogram is produced, single-thread C++ RDF might have a small performance penalty w.r.t. C++ “raw TTree” code, but that penalty is offset by multi-threading and ease of use/generality.

I’ll post a fast C++ RDF version soon.

Cheers,
Enrico

EDIT: we are currently working on producing optimized just-in-time compiled code for RDF, which should greatly reduce runtimes also in Python.

eguiraud · January 22, 2021, 3:15pm

Ah there is another issue with the Python code:

this runs an event loop at each iteration of the for loop:

for ib in range(len(branchlist)):                                                                                       
    print('... doing', branchlist[ib])                                                                                  
    h_incl = d_incl.Histo1D( ("h_{}_incl".format(ib), "h_{}_incl".format(ib), 100, -3, 3) , 'x_{}'.format(ib), 'col_w') 
    h_cut1 = d_cut1.Histo1D( ("h_{}_cut1".format(ib), "h_{}_cut1".format(ib), 100, -3, 3) , 'x_{}'.format(ib), 'col_w') 
    h_cut2 = d_cut2.Histo1D( ("h_{}_cut2".format(ib), "h_{}_cut2".format(ib), 100, -3, 3) , 'x_{}'.format(ib), 'col_w') 
    h_incl.Draw('hist')                                                                                                 
    c1.Print('rdfplots/h_%i_incl.pdf' % ib)                                                                             
    h_cut1.Draw('hist')                                                                                                 
    c1.Print('rdfplots/h_%i_cut1.pdf' % ib)                                                                             
    h_cut2.Draw('hist')                                                                                                 
    c1.Print('rdfplots/h_%i_cut2.pdf' % ib)                                                                             
                                                                                                                        
}

The best way to write this in RDF is:

# book _all_ operations first
histo_tuples = []
for ib in range(len(branchlist)):                                                                                       
    print('... doing', branchlist[ib])                                                                                  
    h_incl = d_incl.Histo1D( ("h_{}_incl".format(ib), "h_{}_incl".format(ib), 100, -3, 3) , 'x_{}'.format(ib), 'col_w') 
    h_cut1 = d_cut1.Histo1D( ("h_{}_cut1".format(ib), "h_{}_cut1".format(ib), 100, -3, 3) , 'x_{}'.format(ib), 'col_w') 
    h_cut2 = d_cut2.Histo1D( ("h_{}_cut2".format(ib), "h_{}_cut2".format(ib), 100, -3, 3) , 'x_{}'.format(ib), 'col_w') 
    histo_tuples.append((h_incl, h_cut1, h_cut2))

# access results second. the first access to one of the results will trigger the event loop
for t in histo_tuples:
    h_incl, h_cut1, h_cut2 = t
    h_incl.Draw('hist')                                                                                                 
    c1.Print('rdfplots/h_%i_incl.pdf' % ib)                                                                             
    h_cut1.Draw('hist')                                                                                                 
    c1.Print('rdfplots/h_%i_cut1.pdf' % ib)                                                                             
    h_cut2.Draw('hist')                                                                                                 
    c1.Print('rdfplots/h_%i_cut2.pdf' % ib)

LucaC · January 22, 2021, 3:39pm

Hi, thanks all for the replies!

Indeed, changing the order of the operations is python speeds up things greatly, this now runs in about 6 seconds, compared to 3-4 for the C++ loop “by hand” version.

So just to be sure I understand, the “in-time” compilation relates to the declarations on the actions to perform (Declare, Filter) that are passed to the RDataFrame as strings, versus the same operations being pre-compiled in an optimised way in the C++ loop version, right?
Because I actually noticed that a C++ version of the same RDataFrame optimised code executes in about the same time.

In this case, would the code with the changes suggested by @eguiraud be the most optimised version for speed?
I saw that the return values of the Filter options are RResultPtr objects, so that should follow the suggestion by @beojan .

Cheers,
Luca

eguiraud · January 22, 2021, 3:43pm

In Python, yes. In C++, you can pass actual C++ functions to Define and Filter rather than strings, and then compile the code with g++ -O2 (O2 adds the compiler optimizations) and that’s the fastest RDF can run. Nice that the runtime of the new Python version is close to the C++ version though, because it’s only getting better in the future.

…fast C++ version incoming…

eguiraud · January 22, 2021, 4:16pm

Times on my laptop (2 cores, while chromium and other stuff was running, gcc 9.3, ROOT 6.22/06), running on 100k datapoints created as above, having the file warm in the filesystem cache:

Python original     70.20
Python fixed        10.60
RDF C++ (O0)         5.38
RDF C++ (O2)         3.92
RDF C++ (O2) + MT(2) 3.00
TTreeFormula (O0)    4.04
TTreeFormula (O2)    3.21

Besides my laptop not being that great (and not idle), one reason for less-than-ideal scaling when moving to 2 cores is that it’s too little data: with 100k entries, there are only 2 clusters in the TTree (a cluster is a bunch of entries zipped together) so RDF only spawns 2 tasks. Ideally you want something like 10 tasks per core to have nice scaling. Anyway, 50% speed-up for free is not terrible.

It’s expected that the TTreeFormula-based version is less sensible to compiler optimization levels: it’s because 99% of the important logic, in the case of TTreeFormula, has already being compiled (with optimizations) in your ROOT installation. RDataFrame includes a lot of C++ template code that is compiled together with user code.

Fully compiled RDF

Note the use of lambda functions instead of strings in Define, Filter and the use of Histo1D<double, double>(...) instead of just Histo1D(...).

#include <ROOT/RDataFrame.hxx>
#include <TCanvas.h>
#include <TH1D.h>
#include <TROOT.h>
#include <string>
#include <vector>

int main() {
   gROOT->SetBatch(true);

   const auto dtset_name = "dataset.root";

   ROOT::EnableImplicitMT(2);
   auto d = ROOT::RDataFrame("tree", dtset_name)
               .Define("col_w", [](double x_0, double x_1) { return x_0 * x_1; }, {"x_0", "x_1"});

   auto cut1 = [](double x_99) { return x_99 < 0; };
   auto cut2 = [](double x_99, double x_98, double x_97) { return x_99 < 0 && x_98 < 0 && x_97 < 0; };

   auto d_incl = d;
   auto d_cut1 = d.Filter(cut1, {"x_99"});
   auto d_cut2 = d.Filter(cut2, {"x_99", "x_98", "x_97"});

   std::vector<std::string> branchlist(100);
   for (int i = 0; i < 100; ++i)
      branchlist[i] = "x_" + std::to_string(i);

   // book histograms
   struct HistoTuple {
      ROOT::RDF::RResultPtr<TH1D> incl;
      ROOT::RDF::RResultPtr<TH1D> cut1;
      ROOT::RDF::RResultPtr<TH1D> cut2;
   };
   std::vector<HistoTuple> histos;
   for (int i = 0; i < 100; ++i) {
      const auto ib = std::to_string(i);
      const auto incl_title = "h_" + ib + "_incl";
      const auto cut1_title = "h_" + ib + "{}" + "_cut1";
      const auto cut2_title = "h_" + ib + "_cut2";

      auto h_incl = d_incl.Histo1D<double, double>({incl_title.c_str(), incl_title.c_str(), 100, -3, 3}, "x_" + ib, "col_w");
      auto h_cut1 = d_cut1.Histo1D<double, double>({cut1_title.c_str(), cut1_title.c_str(), 100, -3, 3}, "x_" + ib, "col_w");
      auto h_cut2 = d_cut2.Histo1D<double, double>({cut2_title.c_str(), cut2_title.c_str(), 100, -3, 3}, "x_" + ib, "col_w");
      histos.emplace_back(HistoTuple{h_incl, h_cut1, h_cut2});
   }

   TCanvas c1("c1", "c1");
   int ib = 0;
   for (auto &h : histos) {
      h.incl->Draw("hist");
      c1.Print(("rdfplots/h_" + std::to_string(ib) + "_incl.pdf").c_str());
      h.cut1->Draw("hist");
      c1.Print(("rdfplots/h_" + std::to_string(ib) + "_cut1.pdf").c_str());
      h.cut2->Draw("hist");
      c1.Print(("rdfplots/h_" + std::to_string(ib) + "_cut2.pdf").c_str());
      ++ib;
   }

   return 0;
}

beojan · January 23, 2021, 10:35am

This is what I was trying to point out when I said the RDataFrame code posted runs an event loop per histogram.

It’s nice to hear that optimized JITting is coming soon, though I still tend to be really wary of using the “expressions in strings” feature because the validity of those expressions can’t be checked until runtime (and you don’t get syntax highlighting either).

LucaC · February 1, 2021, 11:37pm

Hi @eguiraud and @beojan , thank you again for the replies and sorry to come back only now to the thread.
And thanks in particular for the C++ example with lambdas and compiler optimisations, that is very useful.

On that line, given the difference in the timing between Python and C++, is it correct to assume that this is essentially in the efficiency in evaluating the instruction of Define and Filter? Or is there some large overhead from Python itself? (of course excluding startup time and small operations outside the event loop).

I have seen on the manual that one can compile functions in a shared library (so I assume with -O3) and load it with ROOT.gSystem.Load in a Python code. Could this be a viable solution to speed up the code and keep the flexibility of Python?
The flexibility is particularly useful for a code that is meant to be used with a generic set of inputs, variables or selections, possibly configured with some external file, but clearly a factor 2-3 in speed is a big plus for something that is expected to be run often.

Or do you think that this will just become an irrelevant optimisation when the new JITing will be released?

eguiraud · February 2, 2021, 9:05am

The overhead does not come from Python itself (the event loop runs in C++ either way). It’s a combination of effects due to the lack of compiler optimizations of the C++ code that is just-in-time compiled and some added virtual calls that are necessary to call code that was generated at runtime.

It’s certainly worth a shot, hard to predict what kind of impact it will have.

I certainly hope so

beojan · February 2, 2021, 9:35am

If you’re going to the trouble of building a shared library, why not just use compiled C++ for the whole thing?

RDataFrame driven from Python isn’t really more flexible than RDataFrame driven from C++, so long as you use C++17 or 20. If you want Python style string formatting, look up fmtlib.

Where you really benefit from Python is if you use uproot and boost-histogram, but you aren’t doing that here.

LucaC · February 2, 2021, 2:53pm

Hi, thanks for your replies, I will then make the example above with a shared library and post here the results on performance.
But that’s definitely very good to know that anyway the updated JITing will speed this up, thanks for these developments

Regarding the usage of Python vs C++, I agree that RDF have the same interfaces, but it’s all the contour that is easier to handle in Python, at least for my use case.
I need to make a code to fill multiple plots with configurable selections, target tree branches and weights, where all the information is stored ideally in a configuration file that can be copied to keep track of how those histograms have been filled.
Clearly everything that runs in python can be also implemented in C++ but I think there is no match for the flexibility in parsing the inputs and modifying them on-the-fly if needed.

system · February 16, 2021, 2:54pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.