RDataFrame is very slow for many histograms

mdunser · February 6, 2020, 4:46pm

Hi,

I have a problem with RDataFrame, where it doesn’t seem to perform well when filling many histograms. I’m trying to fill N histograms from an RDataFrame by a simple .Filter().Histo1D() operation, and then I want to save these N histograms in a file. This seems to be working as intended, i.e. all histograms are actually only filled once I try doing something with one of them, so the whole event-loop being done once seems to work.

However, the whole procedure depends vastly on what N is. It seems ok for N ~10, but once I increase N to ~100 or more, the time it takes for these operations increases drastically. And that kind of defeats the purpose of using an RDataFrame in the first place…

For instance, filling and saving 20 histograms takes 69.23 seconds, while doing the exact same thing for 200 histograms takes 1725.14 seconds.

Enabling multithreading also doesn’t help.

I attach a script that runs on lxplus7, with a file that can be downloaded by following this link:
https://cernbox.cern.ch/index.php/s/GGDVFWMkwoGEzft

The root version i source is
. /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.16.00/x86_64-centos7-gcc48-opt/bin/thisroot.sh

I also tried ROOT 6.18.00, but I get the same (slow) behavior when increasing N.

Any help would be appreciated!

Best,
-marc

standalone.py (1.6 KB)

ROOT Version: 6.16.00
Platform: lxplus
Compiler: gcc48

Axel · February 7, 2020, 7:36am

Thanks for your report, Marc - that’s indeed not expected. We will profile what’s happening here. Please ping me Wednesday next week should you not hear back from us.

swunsch · February 7, 2020, 12:33pm

Hi Marc,
I’m looking into this. Thanks for the great reproducer!

Here is my measurement of the timings for N histograms (measured on my machine, SSD, single thread):

N  Runtime [s]
2  7
4  11
8  20
16 38
32 85

So the runtime scales with the number of histograms linearly. This means that you don’t gain by deserializing the data only once in one event loop. Moreover, this hints to the act that you spent your time in making the histograms

Note that you put a filter in front of each histogram. I suspect that we spent the actual time in making the selection in the event loop and then filling the histogram. As we expect in this case.

I’ll investigate further and make a profile of your example!

Best
Stefan

Edit: Does this make sense for you? Don’t assume that I am right

swunsch · February 7, 2020, 12:48pm

flamegraph

Edit: Download the svg (right click, save as) and open it again in the browser. Then you can click on the bars!

Here you can have a look at the flamegraph. Just open it in the browser
Forget about chromium and stuff (that’s me browsing the forum besides running the profile) and look at the python process. We actually spend all the time in RunAndCheckFilters, which does what it says. So I think we actually burn the CPU for a good reason.

I would love to have a second opinion on this from @eguiraud

It’s a little bit tricky since the jitting could confuse the profile. Just thinking whether we get here the right picture or not.

mdunser · February 7, 2020, 12:57pm

Hi Stefan,

thanks a lot for having a look into this! I’m not 100% sure how to interpret your answers though. Is this then just a feature of the problem and nothing can be done? Or is there something that can be improved?

Somehow I’m a bit puzzled that even if this code spends most of its time in making/filling the histogram, it seems like a lot of time is spent doing this, no? 85 seconds for 32 histograms…

To me this example just seemed like a perfect place to replace our old TTree->Draw logic to using something that does the loop only once (especially now that we need to do this ~1000 times). Am I wrong and should I just go back to coding my own event loop after having made all my histograms by hand?

Thanks a lot again!
-marc

swunsch · February 7, 2020, 1:14pm

Hi!

So let’s figure this out Since the deserialization of the file takes only a small part of the runtime, using RDataFrame compared to TTree::Draw is not going to make you much faster. But! ->

You can parallelize trivially (means implicitly with EnableImplicitMT) the thing and you’ll be faster because any other workflow would now deserialize the data twice, for example a TTree::Draw workflow using multiple processes. So for sake of scalability, RDataFrame is the way to go.

Another thing that comes to my mind: You are creating an own filter for each histogram, which could be the culprit killing your performance. I guess it would be much more efficient to fill a 3D histogram? Then you get rid of many many calls to Filter going directly to a probably way more efficient Fill call to TH3D.

What do you think?

Best
Stefan

Edit: You need a 3D histogram with pt, eta and mass on the axis.

swunsch · February 7, 2020, 1:35pm

Hi!

Let me give you an idea of what I mean:

# Note that you can write this nicer with experimental PyROOT or even numpy arrays,
# but I stick to this for simplicity now.
binsPt = ROOT.std.vector("double")()
# fill pt bins
binsEta = ROOT.std.vector("double")()
# fill eta bins
binsMass = ROOT.std.vector("double")()
# fill mass bins

df = ROOT.RDataFrame("tree", "file.root")
hist_pass = df.Filter("pass_cut").Histo3D(
    ROOT.RDF.TH3DModel("pass", "pass", binsPt.size(), binsPt.data(), binsEta.size(), binsEta.data(), binsMass.size(), binsMass.data()),
    "pt_branch", "eta_branch", "mass_branch")
hist_fail = df.Filter("fail_cut").Histo3D(
    ROOT.RDF.TH3DModel("fail", "fail", binsPt.size(), binsPt.data(), binsEta.size(), binsEta.data(), binsMass.size(), binsMass.data()),
    "pt_branch", "eta_branch", "mass_branch")

# Save the histograms

I just hacked something together in your reproducer and no guarantee that I didn’t screw up, but this runs in about 4s compared to 40s.

Best
Stefan

eguiraud · February 7, 2020, 1:56pm

Hi!
@mdunser thank you for the simple reproducer.

I think it would be interesting to see a flamegraph for a rewriting of the original code in C++ (so we can see stacktraces more clearly and exclude overhead of jitting and python). I think @swunsch is kindly taking care of that (thanks a lot!).

EDIT:
about @swunsch’s idea of using TH3Ds: absolutely, reducing branching of the computation graph is always good, and filling a TH3D is much faster than checking N filters. I’m still curious about where time is spent exactly in the original code, though.

Cheers,
Enrico

beojan · February 7, 2020, 2:17pm

I’m seeing something similar. I have a loop in which I do filtered_df.Define("wgt").Histo1D(..., "wgt") repeatedly (the definition of the weight changes each time).

Watching memory use in htop, it looks like it slowly creeps up, which suggests the dataframe, or more likely the histograms I made, aren’t deleted even when both the RResultPtr and the RDataFrame have gone out of scope. Everything is done in compiled C++, and I have no JITted strings.

In addition, the time taken also seems to creep up, to the point that running over 25 /fb of 2016 data takes much longer (2 - 3 X) than 44 /fb of 2017 data.

swunsch · February 7, 2020, 2:19pm

Hi!

Could you please open another thread, that would be perfect! Otherwise, we are getting confused here.

In your case, the issue could be something else since you are not cutting extensively but creating new columns in the dataframe (and then creating a histogram).

And if you have as well a reproducer, I would be happy to look into it.

Best
Stefan

mdunser · February 7, 2020, 2:29pm

Hi all,

thanks for all the suggestions. I mean, I’m clearly aware of the fact that I could just fill a 3D histogram and then slice it up the way I want etc. My understanding of the RDataFrame (maybe wrongfully so) was that the internal functions (or rather the “laziness” in not filling at creation time) should be optimized that it doesn’t spend thousands of seconds doing something that isn’t all that complicated.

If filling a TH3D is the best way to go, then I’ll do that. I was just really confused as to where the benefit of the RDataFrame really lies if it seems to not improve anything in what I thought was a rather simple use-case. And clearly I was just lazy as well not wanting to slice up TH3s and just inserting TH1s into the code that I inherited ; )

Best,
-marc

eguiraud · February 7, 2020, 2:49pm

A lazy event loop is not a silver bullet, especially if reading the data is not where you spend most time, as it seems to be the case here.

It might very well be that there is room for performance optimization in RDF for the case of hundreds of filters present – that’s what the bulk of my last comment was about. Let’s see what we dig up.
As a baseline, how long would it take to run the same analysis with the original TTree->Draw logic?

mdunser · February 7, 2020, 2:57pm

I didn’t time it recently, but clearly the old logic was way suboptimal, because for each bin it would do two TTree->Draw commands outputting a TH1 each. Then moving to the next bin and doing it all over again. So at that point the reading of the data does become a huge issue, because it fully does it twice for each bin in pT/eta.

If I remember correctly, it took well north of an hour though, but I can’t give you a better estimate right now.

Let me know if you dig something up, otherwise I’ll just go TTree->Draw() into a TH3. At which point I suppose I can skip using an RDataFrame altogether. It would be a shame, because it’s limited to two dimensions (and I kinda wanted to add another one), but I can also just double the running again, which will still be faster.

Best,
-marc

swunsch · February 7, 2020, 3:26pm

Hi Marc,

As you said, in the case of multiple TTree::Draw calls you burn massively CPU for redoing stuff aka deserialization of the file. I’ve just tested your use-case in C++ and indeed the runtime is not optimal if you use jitting. We will investigate the issue and improve the performance.

However, you can use a solution as follows, replacing the jitting by compiled code and I get almost the same runtime than using a TH3D.

Please use the following code only as suggestions

#include "ROOT/RDataFrame.hxx"
#include <vector>
#include <string>
#include <sstream>
#include <iostream>

int main() {
    std::vector<float> binsPt;
    std::vector<float> binsEta;
    for (int i = 0; i < 4; i++) binsPt.push_back(25 + 5 * i);
    for (int i = 0; i < 4; i++) binsEta.push_back(-2.4 + 0.1 * i);

    ROOT::RDataFrame rdf("IDIsoToHLT/fitter_tree", "triggerTnP_DY_mu.root");

    std::vector<ROOT::RDF::RResultPtr<TH1D>> h_tmp_pass;
    std::vector<ROOT::RDF::RResultPtr<TH1D>> h_tmp_fail;
    int n = 0;
    for(int i = 0; i < binsPt.size() - 1; i++) {
        for(int j = 0; j < binsEta.size() - 1; j++) {
            auto pass = [binsPt, binsEta, i, j](float pt, float eta, float probe) {
                return (pt > 25) && (pt >= binsPt[i]) && (pt < binsPt[i + 1]) && (eta >= binsEta[j]) && (eta < binsEta[j + 1]) && (probe > -1);

            };
            auto fail = [binsPt, binsEta, i, j](float pt, float eta, float probe) {
                return (pt > 25) && (pt >= binsPt[i]) && (pt < binsPt[i + 1]) && (eta >= binsEta[j]) && (eta < binsEta[j + 1]) && !(probe > -1);
            };
            std::stringstream ss2;
            ss2 << "bin" << i << j;
            const auto tmp_name = ss2.str();

            const auto tmp_name_pass = tmp_name + "_Pass";
            const auto tmp_name_fail = tmp_name + "_Fail";
            h_tmp_pass.push_back(
                    rdf.Filter(pass, {"probe_lep_pt", "probe_lep_eta", "probe_muTrgPt"}).Histo1D<float>({tmp_name_pass.c_str(), tmp_name_pass.c_str(), 60, 60., 120.}, "pair_mass"));
            h_tmp_fail.push_back(
                    rdf.Filter(fail, {"probe_lep_pt", "probe_lep_eta", "probe_muTrgPt"}).Histo1D<float>({tmp_name_fail.c_str(), tmp_name_fail.c_str(), 60, 60., 120.}, "pair_mass"));
            n++;
        }
    }
    std::cout << "Created " << 2 * n << " histograms." << std::endl;

    auto f = TFile::Open("standalone2.root", "RECREATE");
    for (auto h : h_tmp_pass) h->Write();
    for (auto h : h_tmp_fail) h->Write();
    f->Close();
}

This runs in around 5s on my machine, compared to 40s of the Python or TTree::Draw version. And you get multi-threading for free, which gives you (in the best case) again a runtime of N.

Another side point: You can also split your filter in two filters (one for the bin, and one for pass/fail) and you’ll do even less work in the event loop.

I think we cannot do more from our side. Feel free to report any other unexpected features you may experience!

Best
Stefan

eguiraud · February 7, 2020, 3:32pm

Thanks @swunsch!
So pure C++ runs in 5s, C++ with just-in-time compilation in 40s.

There is definitely margin for improvement there, just-in-time compilation (which is unavoidable using RDF from python, by the way) should not take that long.

@mdunser on top of the pure C++, using TH3’s will give another speed-up, and multi-threading too as Stefan said. I think we all agree that the performance from the python side is…less than ok. I’ll reply here if we have any update on that (it does not take 40s to compile that code, so there should be margin for improvement).

Cheers,
Enrico

mdunser · February 7, 2020, 3:34pm

Hi Stefan,

thanks a lot for the code. I’m traveling now, but will give it a shot on Monday. I will also see how the different solutions scale with the number of histograms, I see in your example this only does a few bins, instead of the ~1000 I’ll need to run in the end.

And I fully agree that doing multiple TTree::Draw commands is suboptimal, that’s why I wanted to go RDataFrame in the first place ; )

Thanks a bunch for the help in any case!

Have a good weekend,
-m

eguiraud · February 9, 2020, 10:48pm

Hi,
starting with @swunsch’s C++ rewriting, I investigated a little bit. I wanted to figure out where the extra time was spent in the python version vs the C++ version. The obvious difference is that in the C++ version code is compiled ahead of time (as usual, by the compiler) and optimized, while in the python version the code is compiled just-in-time by the ROOT interpreter, and the ROOT interpreter does not do compiler optimization, at least by default.

First of all, with some timers I could verify that time is not spent in the just-in-time compilation itself, but actually running the event loop.

Here are measurements for increasing number of bins (i.e. increasing numbers of Filters and Histo1Ds), compiled code vs just-in-time compiled, with and without compiler optimizations. All runs were single-thread.

All compiled, O3

filters  runtime (s)
18      9.76
162     119.86
392     303.188

All compiled, O0

filters  runtime (s)
18      35.09
162     331.538
392     802.401

Filters just-in-time compiled, program compiled with O3

filters runtime (s)
18     35.5791
162    629.376
392    1511.88

Filters just-in-time compiled, program compiled with O0

filters runtime (s)
18     37.7406
162    647.737
392    1560.87

So program optimization definitely plays a role, but there is a factor 2 between just-in-time compiled filters and filters compiled with O0 for larger number of histograms that I’m not sure where it comes from, and it would be interesting to investigate (although I’m not quite sure how yet).

Cheers,
Enrico

all_compiled.cpp (2.0 KB) jitting_filters.cpp (2.0 KB)

system · February 23, 2020, 10:51pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.