RDataFRame: Should I use Cache() when creating multiple Histo3D plots over the same RDF

Hi ROOT experts,

I’m working on a multi-threaded analysis pipeline using RDataFrame (RDF), and I’d like your advice regarding the possible performance benefits of using .Cache() in our setup.

Here’s what I’m doing:
• We define a number of variables and loop over them.
• For each variable, we define a derived column using a multiplexer function.
• Then we create 3D histograms (using Histo3D) for each variable, with axes: variable, dataset index, and filter index.
• We do this for a single rdf but with optionally the creation of multiple other rdf instances in the future (for different purposes).

This means I loop over all variables, define a derived column per variable (once), and create multiple histograms over the same rdf.

Would using .Cache() after the .Define(…) step be beneficial in this case? Specifically, would it improve performance (e.g., avoiding re-evaluating the multiplexer(…) function for each histogram)? Or does RDF already optimize this internally when only one Define and one Histo3D is made per variable?

Also, would .Cache() be more appropriate at the full dataframe level (rdf.Cache()) before the loop, or per derived variable?

Thanks in advance for your insights!

Best regards,
Alexandros

I think @vpadulan can help here

Hi all,

any feedback on this? That would be greatly appreciated.

Cheers,
Alexandros

Hi all,

any feedback would be greatly appreciated. Perhaps @vpadulan as already suggested?

Cheers,
Alexandros

Hello @attikis,

I’m not a RDF expert, but judging from the documentation and from what you said, specifically:

Or does RDF already optimize this internally when only one Define and one Histo3D is made per variable?

it would seem to me that Cache() would not help you here. The RDF doc says:

Use Cache if you know you will only need a subset of the (Filtered) data that fits in memory and that will be accessed many times.

so if you are only creating one Histo per variable I would not expect the data to be accessed many times, therefore Cache would only make things slower by making an additional copy of the data. However, as is always the case with performance optimization, the only way to know for sure is to measure, so I would try doing that if you’re unsure.