RDataFRame: Should I use Cache() when creating multiple Histo3D plots over the same RDF

Hi ROOT experts,

I’m working on a multi-threaded analysis pipeline using RDataFrame (RDF), and I’d like your advice regarding the possible performance benefits of using .Cache() in our setup.

Here’s what I’m doing:
• We define a number of variables and loop over them.
• For each variable, we define a derived column using a multiplexer function.
• Then we create 3D histograms (using Histo3D) for each variable, with axes: variable, dataset index, and filter index.
• We do this for a single rdf but with optionally the creation of multiple other rdf instances in the future (for different purposes).

This means I loop over all variables, define a derived column per variable (once), and create multiple histograms over the same rdf.

Would using .Cache() after the .Define(…) step be beneficial in this case? Specifically, would it improve performance (e.g., avoiding re-evaluating the multiplexer(…) function for each histogram)? Or does RDF already optimize this internally when only one Define and one Histo3D is made per variable?

Also, would .Cache() be more appropriate at the full dataframe level (rdf.Cache()) before the loop, or per derived variable?

Thanks in advance for your insights!

Best regards,
Alexandros

I think @vpadulan can help here

Hi all,

any feedback on this? That would be greatly appreciated.

Cheers,
Alexandros