ROOT Version: 6.36.02
Platform: linuxx8664gcc
Compiler: g++ (GCC) 15.1.0
Hi, I’m a new developer for MaCh3, which is one of the MCMC tools that is used by a variety of neutrino oscillation experiments. A crucial part of this tool is the derivation of systematically varied variables (e.g. an energy shift) and event weights, the application of some event selection, and then binning those events into a histogram. Those histograms later get used in a likelihood calculation (not using RooFit).
I am interested in how we can use RNTuple and RDataFrame to perform the steps up to and including binning the histogram. These steps get repeated for different parameter values (e.g. a different systematic nuisance value) at every step in the Markov chain, and so must be very fast.
I implemented a slimmed-down version of our workflow with RDataFrame reading from a file with an RNTuple, but have struggled to reach similar performance to MaCh3, which uses arrays of structs and custom C++ to create the histograms. The slimmed-down version was about 7 times slower using RDataFrame.
In an effort to understand the difference in performance, I looked at how long it takes to simply sum over a column. I provide the code below. In this example, the column name is “ELep”, and the dataset is about 49,000 events.
I tried two approaches:
- RDataFrame
- “Raw C++”, which just means taking a vector of floats from the RDataFrame and then summing over the values in a for loop (see the
sumVecfunction)
#include <ROOT/RDataFrame.hxx>
#include <chrono>
float sumVec(std::vector<float> vec) {
float total = 0.0f;
for (float v : vec) {
total += v;
}
return total;
}
int main(int argc, char const *argv[])
{
ROOT::RDataFrame df("Events", argv[1]);
auto df_cached = df.Cache<float>({"ELep"});
auto ELep = df_cached.Take<float>("ELep").GetValue();
auto sum = df_cached.Sum<float>("ELep");
int n_trials = 1000;
std::vector<float> integrals;
integrals.reserve(n_trials);
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < n_trials; ++i) {
sum = df_cached.Sum<float>("ELep"); // comment out to try raw c++
integrals.push_back(sum.GetValue()); // comment out to try raw c++
//integrals.push_back(sumVec(ELep)); // uncomment to try raw c++
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << "Total time: " << duration.count() << " microseconds" << std::endl;
std::cout << "Average time per trial: " << duration.count() / n_trials << " microseconds" << std::endl;
return 0;
}
On my machine (Intel Xeon(R) Gold 6240 CPU), I found that approach (1) took about 1300 microseconds per sum/trial, whereas approach (2) took about 60 microseconds.
In case it’s important, I am running ROOT 6.36.02 using LCG release 108 (x86_64-el9-gcc15-opt).
I wondered whether that might be some overhead that is constant with the number of events, but when trying this over a dataset of 5000 events, I see a similar performance jump with approach (1) at 140 microseconds and approach (2) at 5 microseconds, so that disproves my overhead theory.
I then profiled approach (1) and got these results:
Showing nodes accounting for 1.58s, 100% of 1.58s total
flat flat% sum% cum cum%
0.40s 25.32% 25.32% 0.40s 25.32% std::_Sp_counted_base::_M_release
0.25s 15.82% 41.14% 0.25s 15.82% std::__shared_count::__shared_count
0.18s 11.39% 52.53% 0.18s 11.39% ROOT::Detail::RDF::RLoopManager::RunAndCheckFilters
0.16s 10.13% 62.66% 0.16s 10.13% ROOT::Internal::RDF::RAction::Run
0.12s 7.59% 70.25% 0.12s 7.59% ROOT::RDF::RLazyDS::SetEntry
0.08s 5.06% 75.32% 0.08s 5.06% ROOT::Internal::RDF::RAction::GetValueChecked
I find it curious that so much of the time is spent on operations related to shared pointers, but without more knowledge about the RDataFrame implementation, I can’t explain it.
My questions are:
- Is there anything obviously wrong in my RDataFrame implementation above?
- Why is there such a performance discrepancy?
- In what way does RDataFrame scale well? In other words, in what use cases do you expect RDataFrame to be quick compared to alternatives? Where does it shine?
- Is our workflow of constantly recalculating histograms a good place to think about RDataFrame? I worry that RDataFrame is designed for a single pass over the data, and for very easy/good streaming of that data from disk. Maybe it doesn’t suit our use case, where we have all our data in memory and are iterating over it at a fast rate.
Thanks in advance!