RDataFrame and suitability for rapid iterative passes through datasets


ROOT Version: 6.36.02
Platform: linuxx8664gcc
Compiler: g++ (GCC) 15.1.0


Hi, I’m a new developer for MaCh3, which is one of the MCMC tools that is used by a variety of neutrino oscillation experiments. A crucial part of this tool is the derivation of systematically varied variables (e.g. an energy shift) and event weights, the application of some event selection, and then binning those events into a histogram. Those histograms later get used in a likelihood calculation (not using RooFit).

I am interested in how we can use RNTuple and RDataFrame to perform the steps up to and including binning the histogram. These steps get repeated for different parameter values (e.g. a different systematic nuisance value) at every step in the Markov chain, and so must be very fast.

I implemented a slimmed-down version of our workflow with RDataFrame reading from a file with an RNTuple, but have struggled to reach similar performance to MaCh3, which uses arrays of structs and custom C++ to create the histograms. The slimmed-down version was about 7 times slower using RDataFrame.

In an effort to understand the difference in performance, I looked at how long it takes to simply sum over a column. I provide the code below. In this example, the column name is “ELep”, and the dataset is about 49,000 events.

I tried two approaches:

  1. RDataFrame
  2. “Raw C++”, which just means taking a vector of floats from the RDataFrame and then summing over the values in a for loop (see the sumVec function)
#include <ROOT/RDataFrame.hxx>
#include <chrono>

float sumVec(std::vector<float> vec) {
  float total = 0.0f;
  for (float v : vec) {
    total += v;
  }
  return total;
}

int main(int argc, char const *argv[])
{  
  ROOT::RDataFrame df("Events", argv[1]);
  auto df_cached = df.Cache<float>({"ELep"});
  auto ELep = df_cached.Take<float>("ELep").GetValue();

  auto sum = df_cached.Sum<float>("ELep");

  int n_trials = 1000;
  std::vector<float> integrals;
  integrals.reserve(n_trials);

  auto start = std::chrono::high_resolution_clock::now();

  for (int i = 0; i < n_trials; ++i) {
    sum = df_cached.Sum<float>("ELep"); // comment out to try raw c++
    integrals.push_back(sum.GetValue()); // comment out to try raw c++
    //integrals.push_back(sumVec(ELep)); // uncomment to try raw c++
  }

  auto end = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
  std::cout << "Total time: " << duration.count() << " microseconds" << std::endl;
  std::cout << "Average time per trial: " << duration.count() / n_trials << " microseconds" << std::endl;

  return 0;
}

On my machine (Intel Xeon(R) Gold 6240 CPU), I found that approach (1) took about 1300 microseconds per sum/trial, whereas approach (2) took about 60 microseconds.

In case it’s important, I am running ROOT 6.36.02 using LCG release 108 (x86_64-el9-gcc15-opt).

I wondered whether that might be some overhead that is constant with the number of events, but when trying this over a dataset of 5000 events, I see a similar performance jump with approach (1) at 140 microseconds and approach (2) at 5 microseconds, so that disproves my overhead theory.

I then profiled approach (1) and got these results:

Showing nodes accounting for 1.58s, 100% of 1.58s total
      flat  flat%   sum%        cum   cum%
     0.40s 25.32% 25.32%      0.40s 25.32%  std::_Sp_counted_base::_M_release
     0.25s 15.82% 41.14%      0.25s 15.82%  std::__shared_count::__shared_count
     0.18s 11.39% 52.53%      0.18s 11.39%  ROOT::Detail::RDF::RLoopManager::RunAndCheckFilters
     0.16s 10.13% 62.66%      0.16s 10.13%  ROOT::Internal::RDF::RAction::Run
     0.12s  7.59% 70.25%      0.12s  7.59%  ROOT::RDF::RLazyDS::SetEntry
     0.08s  5.06% 75.32%      0.08s  5.06%  ROOT::Internal::RDF::RAction::GetValueChecked

I find it curious that so much of the time is spent on operations related to shared pointers, but without more knowledge about the RDataFrame implementation, I can’t explain it.

My questions are:

  1. Is there anything obviously wrong in my RDataFrame implementation above?
  2. Why is there such a performance discrepancy?
  3. In what way does RDataFrame scale well? In other words, in what use cases do you expect RDataFrame to be quick compared to alternatives? Where does it shine?
  4. Is our workflow of constantly recalculating histograms a good place to think about RDataFrame? I worry that RDataFrame is designed for a single pass over the data, and for very easy/good streaming of that data from disk. Maybe it doesn’t suit our use case, where we have all our data in memory and are iterating over it at a fast rate.

Thanks in advance!

First, welcome to the ROOT Forum!
Then, I’m sure @vpadulan can help you with these questions

Vincenzo will give a clearer and more complete view, but a few quick points, from limited RDataFrame knowledge (I may be wrong in some of this! Hopefully Vincenzo or another expert will correct me if so):

  • There is an overhead, and you’ll probably need many more events than 50k to notice differences.
  • Also, RDataFrame can run in parallel (with ImplicitMT enabled), but again, you need a lot of data to make it worth it (see, e.g. RDataFrame seems too conservative about spawning new threads)
  • Your example has a Take.GetValue, and later a sum.GetValue, which I think is triggering 2 event loops, not the best for RDataFrame performance; lazy and instant actions should be planned carefully.

You can also check out the performace tips in the documentation: ROOT: ROOT::RDataFrame Class Reference

Hi dastudillo,

Thanks for your reply. In response to your comments:

  • I went up to 100M events, and RDataFrame still lags behind raw C++ by a factor of 10
  • I have experimented with multithreading, and with complex enough operations, it did help out. The forum post you linked is quite helpful for understanding how I can change the multithreading behaviour, which I was wondering about. However, I want to compare like-for-like in this case, and therefore use single-threaded RDataFrame when comparing to raw C++.
  • I think you’re right that Take.GetValue and sum.GetValue both trigger passes through the data but I am only using Take.GetValue once to retrieve a vector of floats that I can benchmark the raw C++ with. The bits I am benchmarking and comparing are in the for loop, where I run sum.GetValue or the equivalent raw C++ sumVec(ELep) many times.

I have already checked out the performance tips in the documentation, which have helped to speed up the RDataFrame code to the point it is now. It’s not clear to me how it could be any quicker.