_ROOT Version: 6.14.04 Platform: not relevant Compiler: not relevant
For me the RDataFrame has large similarities with a memory-resident ntuple.
With an ntuple, if created in procedure X, it can easily be retrieved through the gDirectory in procedure Y.
How do I do that with an RDataFrame? Making a snapshot in X and retrieving it in Y by accessing the created TTree on disk through the appropriate constructor is
I/O wasteful. It was created anyhow in memory and better fit again somewhere else.
I think the functionality covered by Cache could help to satisfy this use case.
The idea is to transfer to memory the content of a data frame, irrespective of how it was created (from a data source, from scratch, from a tfile…).
Here you have the relevant example and here a minimal snippet:
ROOT::RDataFrame df("myTreeName", "myFileName.root");
auto cached_df = df.Cache(); // <-- now cached_df wraps a dataset in memory
Note that the cache is actually filled the first time the cached_df is used according to the general phylosophy of data frame.
Hi Eddy,
to add to Danilo’s reply, note that dataframe variables are quite lightweight and you can pass them around (both by value and by reference is fine). All dataframe variables are convertible to the same common type ROOT::RDF::RNode, which makes it possible to write functions that manipulate dataframes without resorting to templates:
So you can have different functions performing operations on dataframes and you can pass them around like
usual variables. “ROOT7”/modern ROOT interfaces do not rely on global state as much, and dataframes are not registered in gDirectory.
and RDF will take the type of whatever dataframe node you pass to the function.
We are moving further from the original topic of the question though
My point to Eddy’s question was simply that RDF does not rely on gDirectory and ROOT’s global state and you have to treat dataframes as standard C++ variables: if you want to create a dataframe in a function and use it from another function you “simply” pass the variable around (in v6.14 this is made less trivial by the fact that you cannot cast every dataframe variable to the same common type).
I believe that the answers did not address my issue yet, I guess I did not state it clearly.
There is lots of code that does something along the following lines:
so create somewhere a tree and retrieve it later through the gDirectory or a global variable. I do not know how to accomplish this with RDataFrame. Sure it is easy to
pass RDataFrame “down” to a procedure with a reference but how to pass it “up” ?
Let me give some incorrect(!) concrete examples as discussion points.
Hi Eddy,
as I said we do not support writing dataframes to TDirectories.
You can certainly use a global variable: if you make it a std::unique_ptr<ROOT::RDF::RNode> you get reasonable lifetime management and the possibility to store any dataframe node.
We do not encourage this programming model though: global state is bad.
As @sbinet suggests, it would probably take a bit of refactoring, but it’s certainly possible to change the code so that dataframes are passed around rather than stored and retrieved. I realized this is a change of paradigm and possibly not a welcome one, but I have seen main functions of RDF analyses look pretty much like this:
auto filtered_df = ApplySelections(MakeRDF());
auto histos = BookHistograms(DefineQuantities(filtered_df));
auto cutReports = filtered_df.Report();
cutReports->Print();
and as far as I can tell this is just as modular and flexible as the approach with a global dataframe variable, without the downsides of global state.
Also consider that building an RDF computation graph is a fairly lightweight operation, so it doesn’t save much time to store an RDF to a TFile on disk for later usage: re-building the computation graph at every execution costs nothing (w.r.t. the actual computation that RDF performs).
I’m sorry I don’t have an answer more in line with the question
Cheers,
Enrico
It does not work because the necessary constructors have been (luckily) disabled .
Enrico, please give a little skeleton code explaining what ApplySelections
(returning a RDataFrame)
in "auto filtered_df = ApplySelections(MakeRDF()); " would look like.
In C++11 you can use ROOT::RDF::RNode as the return type with a small performance hit (I would be curious to see if the performance hit of using a ROOT::RDF::RNode is measurable at all in your application, I bet it itsn’t).
Your example was very enlightening ! The necessary constructors are
sitting at a different place …
Up to this point I have only been playing with the (very good) tutorials dealing
with RDataFrame. It would be helpful to add a few examples showing how to
carry around these containers in the code.