I am trying to find the fastest way to test yields for different cut values for optimization, and so far I’ve been using pandas dataframes because some rough benchmarking implied it was faster. But the more I poke and prod, the more suspicious I am of the performance numbers.
Here’s a minimal working example (adopted from the /doc/master/df019__Cache_8py.html
ROOT docs tutorial) that shows relative performance using an RDataFrame, a (supposedly) cached RDataFrame and a pandas dataframe.
import ROOT
import os, sys, timeit
from glob import glob
import numpy as np
import pandas as pd
temp_dir = './temp.root'
tree_name = "ntuple"
if glob(temp_dir):
print('Local copy found, loading...')
df = ROOT.RDataFrame(tree_name, temp_dir)
print('Done')
else:
print('No local copy found, loading...')
hsimplePath = os.path.join(str(ROOT.gROOT.GetTutorialDir().Data()), "hsimple.root")
df = ROOT.RDataFrame(tree_name, hsimplePath)
print('Saving...')
df.Snapshot(tree_name, temp_dir)
print(tree_name, "Saved")
# We define a new column
df = df.Define("px_plus_py", "px + py")
# We cache the content of the dataset. Nothing has happened yet: the work to accomplish
# has been described.
df_cached = df.Cache()
t_ndarr = df.AsNumpy()
pdf = pd.DataFrame(t_ndarr)
del t_ndarr
print('Timing uncached RootDataFrame version: ')
timer_df = timeit.timeit("t_res = df.Filter('px>0.1').Sum('random').GetValue(); print('-',end='')", globals=globals(), number=50)
print(' ', timer_df)
print('Timing cached RootDataFrame version: ')
timer_df_cached = timeit.timeit("t_res = df_cached.Filter('px>0.1').Sum('random').GetValue(); print('-',end='')", globals=globals(), number=50)
print(' ', timer_df_cached)
print('Timing pandas dataframe version: ')
timer_pdf = timeit.timeit("t_res = pdf.query('px>0.1')['random'].sum(); print('-',end='')", globals=globals(), number=50)
print(' ', timer_pdf)
This yields the following output:
Welcome to JupyROOT 6.24/06
local copy found, loading...
done
Timing uncached RootDataFrame version:
-------------------------------------------------- 25.99562014453113
Timing cached RootDataFrame version:
-------------------------------------------------- 29.31531055085361
Timing pandas dataframe version:
-------------------------------------------------- 0.21160733606666327
I am not sure what I’m doing here, am I not caching the RDataFrame? Am I accidently caching both df
and df_cached
? Why is pandas doing so damn well?
Invoking sys.getsizeof()
on the df
, df_cached
and pdf
returns 64, 64, 600144
but that might just be because RDataFrames are opaque to this method.
This was done in a SWAN notebook on the latest software stack, 101
ROOT Version: 6.24/06
Platform: CentOS 7
Compiler: gcc8