RDataFrame and suitability for rapid iterative passes through datasets


ROOT Version: 6.36.02
Platform: linuxx8664gcc
Compiler: g++ (GCC) 15.1.0


Hi, I’m a new developer for MaCh3, which is one of the MCMC tools that is used by a variety of neutrino oscillation experiments. A crucial part of this tool is the derivation of systematically varied variables (e.g. an energy shift) and event weights, the application of some event selection, and then binning those events into a histogram. Those histograms later get used in a likelihood calculation (not using RooFit).

I am interested in how we can use RNTuple and RDataFrame to perform the steps up to and including binning the histogram. These steps get repeated for different parameter values (e.g. a different systematic nuisance value) at every step in the Markov chain, and so must be very fast.

I implemented a slimmed-down version of our workflow with RDataFrame reading from a file with an RNTuple, but have struggled to reach similar performance to MaCh3, which uses arrays of structs and custom C++ to create the histograms. The slimmed-down version was about 7 times slower using RDataFrame.

In an effort to understand the difference in performance, I looked at how long it takes to simply sum over a column. I provide the code below. In this example, the column name is “ELep”, and the dataset is about 49,000 events.

I tried two approaches:

  1. RDataFrame
  2. “Raw C++”, which just means taking a vector of floats from the RDataFrame and then summing over the values in a for loop (see the sumVec function)
#include <ROOT/RDataFrame.hxx>
#include <chrono>

float sumVec(std::vector<float> vec) {
  float total = 0.0f;
  for (float v : vec) {
    total += v;
  }
  return total;
}

int main(int argc, char const *argv[])
{  
  ROOT::RDataFrame df("Events", argv[1]);
  auto df_cached = df.Cache<float>({"ELep"});
  auto ELep = df_cached.Take<float>("ELep").GetValue();

  auto sum = df_cached.Sum<float>("ELep");

  int n_trials = 1000;
  std::vector<float> integrals;
  integrals.reserve(n_trials);

  auto start = std::chrono::high_resolution_clock::now();

  for (int i = 0; i < n_trials; ++i) {
    sum = df_cached.Sum<float>("ELep"); // comment out to try raw c++
    integrals.push_back(sum.GetValue()); // comment out to try raw c++
    //integrals.push_back(sumVec(ELep)); // uncomment to try raw c++
  }

  auto end = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
  std::cout << "Total time: " << duration.count() << " microseconds" << std::endl;
  std::cout << "Average time per trial: " << duration.count() / n_trials << " microseconds" << std::endl;

  return 0;
}

On my machine (Intel Xeon(R) Gold 6240 CPU), I found that approach (1) took about 1300 microseconds per sum/trial, whereas approach (2) took about 60 microseconds.

In case it’s important, I am running ROOT 6.36.02 using LCG release 108 (x86_64-el9-gcc15-opt).

I wondered whether that might be some overhead that is constant with the number of events, but when trying this over a dataset of 5000 events, I see a similar performance jump with approach (1) at 140 microseconds and approach (2) at 5 microseconds, so that disproves my overhead theory.

I then profiled approach (1) and got these results:

Showing nodes accounting for 1.58s, 100% of 1.58s total
      flat  flat%   sum%        cum   cum%
     0.40s 25.32% 25.32%      0.40s 25.32%  std::_Sp_counted_base::_M_release
     0.25s 15.82% 41.14%      0.25s 15.82%  std::__shared_count::__shared_count
     0.18s 11.39% 52.53%      0.18s 11.39%  ROOT::Detail::RDF::RLoopManager::RunAndCheckFilters
     0.16s 10.13% 62.66%      0.16s 10.13%  ROOT::Internal::RDF::RAction::Run
     0.12s  7.59% 70.25%      0.12s  7.59%  ROOT::RDF::RLazyDS::SetEntry
     0.08s  5.06% 75.32%      0.08s  5.06%  ROOT::Internal::RDF::RAction::GetValueChecked

I find it curious that so much of the time is spent on operations related to shared pointers, but without more knowledge about the RDataFrame implementation, I can’t explain it.

My questions are:

  1. Is there anything obviously wrong in my RDataFrame implementation above?
  2. Why is there such a performance discrepancy?
  3. In what way does RDataFrame scale well? In other words, in what use cases do you expect RDataFrame to be quick compared to alternatives? Where does it shine?
  4. Is our workflow of constantly recalculating histograms a good place to think about RDataFrame? I worry that RDataFrame is designed for a single pass over the data, and for very easy/good streaming of that data from disk. Maybe it doesn’t suit our use case, where we have all our data in memory and are iterating over it at a fast rate.

Thanks in advance!

First, welcome to the ROOT Forum!
Then, I’m sure @vpadulan can help you with these questions

Vincenzo will give a clearer and more complete view, but a few quick points, from limited RDataFrame knowledge (I may be wrong in some of this! Hopefully Vincenzo or another expert will correct me if so):

  • There is an overhead, and you’ll probably need many more events than 50k to notice differences.
  • Also, RDataFrame can run in parallel (with ImplicitMT enabled), but again, you need a lot of data to make it worth it (see, e.g. RDataFrame seems too conservative about spawning new threads)
  • Your example has a Take.GetValue, and later a sum.GetValue, which I think is triggering 2 event loops, not the best for RDataFrame performance; lazy and instant actions should be planned carefully.

You can also check out the performace tips in the documentation: ROOT: ROOT::RDataFrame Class Reference

Hi dastudillo,

Thanks for your reply. In response to your comments:

  • I went up to 100M events, and RDataFrame still lags behind raw C++ by a factor of 10
  • I have experimented with multithreading, and with complex enough operations, it did help out. The forum post you linked is quite helpful for understanding how I can change the multithreading behaviour, which I was wondering about. However, I want to compare like-for-like in this case, and therefore use single-threaded RDataFrame when comparing to raw C++.
  • I think you’re right that Take.GetValue and sum.GetValue both trigger passes through the data but I am only using Take.GetValue once to retrieve a vector of floats that I can benchmark the raw C++ with. The bits I am benchmarking and comparing are in the for loop, where I run sum.GetValue or the equivalent raw C++ sumVec(ELep) many times.

I have already checked out the performance tips in the documentation, which have helped to speed up the RDataFrame code to the point it is now. It’s not clear to me how it could be any quicker.

Out of curiosity … what happens if you comment out the line:
sum = df_cached.Sum<float>("ELep"); // comment out to try raw c++

You will then (re)use the “sum” initialized (just once) outside of the “for” loop.

When doing that, the measured time comes down by a factor of n_trails. I’m fairly certain that when doing this, the computation graph will be computed once when you first call sum.getValue() and then every subsequent call in the loop will reuse the value that was computed previously.

However, this is not what I’d like to do. Returning to the original motivation involving reweighting/shifting events and binning histograms, this needs to be redone at every step in the MCMC, so we can not rely on pre-computed values.

Well, maybe instead of creating a new “sum” in the “for” loop, we need a way to “invalidate” the computed “value” (i.e., make it “Is-Not-Ready”, without destroying the already-loaded, cached columns).

Dear @Charlotte-Knight ,

Thanks for reaching out to the forum, and for presenting us with a new interesting use case!

As a general rule of thumb, RDataFrame is aimed at providing one coherent data analysis API for scientific computing of physics experiments (and a bit beyond that). One of the key, not so implicit characteristics of the set of use cases that RDataFrame is aimed at is the requirement for the analysis to be complex, for some definition of this word in terms of input dataset size, number of operations, computational cost of each operation. From all that I can read in this post, I see performance being compared in terms of microseconds, up to a second in the longest case. This scenario cannot possibly fit within an API such as RDataFrame, and in general I find that it would be hard to make it fit with any API that is not just a raw sum of vectors as you already show. If your application does not last more than a few minutes already without RDataFrame, there is little to no chance that RDataFrame can make it faster.

Now, after the preamble above which is necessary for the rest that I’m about to write, in the case that I missed something from your description such that your application does indeed fit into some definition of complex, then I think I can give more insights. RDataFrame is a lazy API: each operation must be booked upfront and the results must all be queried together at the end, for maximum performance (see the docs for details ROOT: ROOT::RDataFrame Class Reference ). I can see from your example above that you are already triggering the computation graph too many times (look for the calls to e.g. GetValue). As a second piece of information to keep in mind, RDataFrame does indeed have to do some bookkeeping of the computation graph, which introduces both an overhead at startup and some extra work to be done (what you see with the shared_ptr management for example). Keep in mind that in many occasions RDataFrame has provided factors (even orders of magnitude) better performance for production LHC analysis use cases.

If you want to dig deeper into this topic, I can also propose to have a call to discuss more details. Let me know.

Cheers,

Vincenzo

1 Like

Hi Vincenzo,

Thanks for the reply.

Our use case is more complex than the summing example I used above, and corresponds to ~0.1s operation over an entire dataset cached in memory, but that still sounds too quick to be suitable for RDataFrame. Over millions of steps in the Markov Chain, this does correspond to jobs of O(days), and therefore might be suitable… but because each step in the chain determines the parameters with which you process the dataset for the next step, I don’t think we can book all the operations upfront, and therefore things are still unsuitable.

Nevertheless, I still think it’d be useful to have a call to discuss further. I’d like to confirm that we understand each other on this topic, and there may be other applicable use cases for RDataFrame in our framework. Please get in touch via my email: charlotte.knight @imperial.ac.uk, if you’re still happy to call.

Cheers,

Charlotte

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.