Scalability of RDataFrames on 16+ cores

eguiraud · April 29, 2021, 9:21am

Wow, ok I’ll have to study that paper

Unless I miss something, two important issues I see are:

the conclusions in term of scalability on the original dataset (54M events) are misleading: at least in the case of ROOT (possibly also other frameworks) scaling will be much better when it matters more, i.e. when runtimes are longer than a few seconds (because the dataset is larger than 17GB and/or because you read more of it)
if you ran ROOT macros via the ROOT interpreter (as it seems from opendata-benchmarks/run_benchmark.sh at master · masonproffitt/opendata-benchmarks · GitHub) , you are running C++ code at O0 optimization: that’s “wrong”, as in no analysis group that cares about performance would do that. This is separate from the matter of df.Filter("x > 0") vs df.Filter([] (float x) { return x > 0; }), where the latter will give you much better performance but one could argue that the former is much nicer to write. In contrast, there is little motivation to run an analysis on 64 cores at O0 optimization