TDataFrame feedback: performance comparisons, df.Count() instability

eguiraud · November 2, 2018, 11:07am

Hi @rmadar ,
this is super useful feedback.

TDataFrame in v6.12 was the first prototype, it’s now obsolete and as you see it was also much slower, so I would focus on the numbers you produced for v6.14, C++ and python.

A few questions before we dig deeper in the performance measurements:

RDataFrame parallelizes over TTree clusters. How many clusters does your dataset have? You can check with tree->Print("clusters") or with the method described here
from your code it seems that the explicit event loop is doing less work, is it an apple to apple comparison?
could you produce timings for code compiled with optimizations? (g++ -O3)
what are the timings for RDataFrame without MT?
how hard would it be to try the same with ROOT master branch?

Regarding the instability of Count(): it looks like a bug, would it be possible to have a standalone reproducer that we can debug?

Many thanks for the super interesting feedback again!
Cheers,
Enrico