RDataFrames, caching and performance compared to pandas

OlafurSiemsen · October 28, 2021, 12:54pm

I am trying to find the fastest way to test yields for different cut values for optimization, and so far I’ve been using pandas dataframes because some rough benchmarking implied it was faster. But the more I poke and prod, the more suspicious I am of the performance numbers.
Here’s a minimal working example (adopted from the /doc/master/df019__Cache_8py.html ROOT docs tutorial) that shows relative performance using an RDataFrame, a (supposedly) cached RDataFrame and a pandas dataframe.

import ROOT
import os, sys, timeit
from glob import glob
import numpy as np
import pandas as pd

temp_dir = './temp.root'
tree_name = "ntuple"
if glob(temp_dir):
    print('Local copy found, loading...')
    df = ROOT.RDataFrame(tree_name, temp_dir)
    print('Done')
else:
    print('No local copy found, loading...')
    hsimplePath = os.path.join(str(ROOT.gROOT.GetTutorialDir().Data()), "hsimple.root")
    df = ROOT.RDataFrame(tree_name, hsimplePath)
    print('Saving...')
    df.Snapshot(tree_name, temp_dir)
    print(tree_name, "Saved")

# We define a new column
df = df.Define("px_plus_py", "px + py")
 
# We cache the content of the dataset. Nothing has happened yet: the work to accomplish
# has been described.
df_cached = df.Cache()

t_ndarr = df.AsNumpy()
pdf = pd.DataFrame(t_ndarr)
del t_ndarr

print('Timing uncached RootDataFrame version: ')
timer_df = timeit.timeit("t_res = df.Filter('px>0.1').Sum('random').GetValue(); print('-',end='')", globals=globals(), number=50)
print('   ', timer_df)

print('Timing cached RootDataFrame version: ')
timer_df_cached = timeit.timeit("t_res = df_cached.Filter('px>0.1').Sum('random').GetValue(); print('-',end='')", globals=globals(), number=50)
print('   ', timer_df_cached)

print('Timing pandas dataframe version: ')
timer_pdf = timeit.timeit("t_res = pdf.query('px>0.1')['random'].sum(); print('-',end='')", globals=globals(), number=50)
print('   ', timer_pdf)

This yields the following output:

Welcome to JupyROOT 6.24/06
local copy found, loading...
done
Timing uncached RootDataFrame version: 
--------------------------------------------------    25.99562014453113
Timing cached RootDataFrame version: 
--------------------------------------------------    29.31531055085361
Timing pandas dataframe version: 
--------------------------------------------------    0.21160733606666327

I am not sure what I’m doing here, am I not caching the RDataFrame? Am I accidently caching both df and df_cached? Why is pandas doing so damn well?
Invoking sys.getsizeof() on the df, df_cached and pdf returns 64, 64, 600144 but that might just be because RDataFrames are opaque to this method.

This was done in a SWAN notebook on the latest software stack, 101
ROOT Version: 6.24/06
Platform: CentOS 7
Compiler: gcc8

bellenot · October 28, 2021, 1:16pm

I’m sure @eguiraud can help you

eguiraud · October 28, 2021, 4:18pm

Hi @OlafurSiemsen ,
and welcome to the ROOT forum!
On my laptop:

~ python foo.py
Local copy found, loading...
Done

Timing uncached RootDataFrame version:
--------------------------------------------------    4.0802746700010175
Timing cached RootDataFrame version:
--------------------------------------------------    3.817405771000267
Timing pandas dataframe version:
--------------------------------------------------    0.05196640599933744

Pandas is reading uncompressed data from RAM, which is fast (but does not scale to larger-than-memory datasets and with enough histograms to fill and enough data RDF will be faster, especially if you turn on EnableImplicitMT).

The non-cached RDF is decompressing and reading data from disk, which is slow. It also needs to just-in-time compile some C++ code starting from string expressions such as px + py, which takes some more time. Usage of RDF from Python currently also incurs in a large performance penalty due to missing optimizations opportunities in the just-in-time compilation, which we plan to address in the next ROOT release (with expected speed-ups for user applications of 1.5x to 2x, see the “No opt vs opt” benchmarks here).

The cached RDF case is more interesting. I would expect it to be slower than pandas but faster than the non-cached version, I will take a look as soon as possible.

Cheers,
Enrico

P.S.
With that said, if you need to perform simple calculations on tabular data that fits in RAM, it’s definitely a good option to load it as numpy arrays with RDF (that lets you select the events you want and/or define extra columns if need be) and then use pandas or similar.
If you need to process larger-than-memory datasets and perform non-trivial processing + filling of many histograms, RDF + multi-threading is probably a good choice (besides the HEP-oriented features already available, it will also soon be faster and we are adding a built-in syntax to express systematic variations)

eguiraud · November 1, 2021, 4:44pm

Hi again,
I activated RDF verbose logging by adding the following line at the beginning of the program:

verbosity = ROOT.Experimental.RLogScopedVerbosity(ROOT.Detail.RDF.RDFLogChannel(), ROOT.Experimental.ELogLevel.kInfo)

then I reduced the number of iterations to 1 to check the logs for each event loop.

Since the dataset contains only 25k entries, the actual event loop time is as little as 0.006s. The runtime of the RDF benchmarks is dominated by the time spent jitting (0.1 seconds).

Running on 100 million events with the following C++ program, I see the cached RDF takes 0.22s while the non-cached RDF takes 1.17s:

#include <ROOT/RDataFrame.hxx>
#include <ROOT/RLogger.hxx>

int main() {
  auto verbosity = ROOT::Experimental::RLogScopedVerbosity(
      ROOT::Detail::RDF::RDFLogChannel(), ROOT::Experimental::ELogLevel::kInfo);

  auto df = ROOT::RDataFrame("ntuple", "temp.root");
  auto df2 = df.Define("px_plus_py", [](double px, double py) { return px + py; },
                       {"px", "py"});

  auto df_cached = df2.Cache<double, double, double>({"px", "py", "random"});

  std::cout << "Caching...\n";
  df_cached.Sum<double>("px").GetValue();
  std::cout << "Done\n\n";

  std::cout << "Cached run...\n";
  df_cached.Filter([](double px) { return px > 0.1; }, {"px"})
      .Sum<double>("random")
      .GetValue();
  std::cout << "Done\n\n";

  std::cout << "Non-cached run...\n";
  df.Filter([](double px) { return px > 0.1; }, {"px"}).Sum<double>("random").GetValue();
  std::cout << "Done\n\n";
}

used as:

g++ -g -O2 -Wall -Wextra -Wpedantic -o "repro" "repro.cpp" $(root-config --cflags --libs) && ./repro

So for beefier workloads caching does make a difference. Notably, using -O0 (no optimizations) instead of -O2 reduces the gap by a lot – and currently -O0 is what you get when running things from Python, which also contributes to explain what you see (we will add optimizations to Python usage of RDataFrame very soon).

I hope this clarifies things a bit!
Cheers,
Enrico

system · November 15, 2021, 4:44pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.