TDataFrame feedback: performance comparisons, df.Count() instability

Hi @rmadar ,
this is super useful feedback.

TDataFrame in v6.12 was the first prototype, it’s now obsolete and as you see it was also much slower, so I would focus on the numbers you produced for v6.14, C++ and python.

A few questions before we dig deeper in the performance measurements:

  • RDataFrame parallelizes over TTree clusters. How many clusters does your dataset have? You can check with tree->Print("clusters") or with the method described here
  • from your code it seems that the explicit event loop is doing less work, is it an apple to apple comparison?
  • could you produce timings for code compiled with optimizations? (g++ -O3)
  • what are the timings for RDataFrame without MT?
  • how hard would it be to try the same with ROOT master branch?

Regarding the instability of Count(): it looks like a bug, would it be possible to have a standalone reproducer that we can debug?

Many thanks for the super interesting feedback again!
Cheers,
Enrico