How to create a global (persistent) RDataFrame

Eddy_Offermann · September 30, 2018, 12:52am

_ROOT Version: 6.14.04
Platform: not relevant
Compiler: not relevant

For me the RDataFrame has large similarities with a memory-resident ntuple.
With an ntuple, if created in procedure X, it can easily be retrieved through the gDirectory in procedure Y.

How do I do that with an RDataFrame? Making a snapshot in X and retrieving it in Y by accessing the created TTree on disk through the appropriate constructor is
I/O wasteful. It was created anyhow in memory and better fit again somewhere else.

-Eddy

Danilo · October 1, 2018, 6:40am

Hi Eddy,

I think the functionality covered by Cache could help to satisfy this use case.
The idea is to transfer to memory the content of a data frame, irrespective of how it was created (from a data source, from scratch, from a tfile…).
Here you have the relevant example and here a minimal snippet:

ROOT::RDataFrame df("myTreeName", "myFileName.root");
auto cached_df = df.Cache(); // <-- now cached_df wraps a dataset in memory

Note that the cache is actually filled the first time the cached_df is used according to the general phylosophy of data frame.

Cheers,
D

eguiraud · October 1, 2018, 8:08am

Hi Eddy,
to add to Danilo’s reply, note that dataframe variables are quite lightweight and you can pass them around (both by value and by reference is fine). All dataframe variables are convertible to the same common type ROOT::RDF::RNode, which makes it possible to write functions that manipulate dataframes without resorting to templates:

void PlotDf(ROOT::RDF::RNode df) { ... }
auto df = ROOT::RDataFrame(10).Define("x", "rand()").Define("y", "x*x");
PlotDF(df);

So you can have different functions performing operations on dataframes and you can pass them around like
usual variables. “ROOT7”/modern ROOT interfaces do not rely on global state as much, and dataframes are not registered in gDirectory.

Cheers,
Enrico

Suyong_Choi · October 1, 2018, 10:04am

Hi Enrico,

I think ROOT::RDF::RNode is not defined in in ROOT 6.14. How could I do it in 6.1.4?

Regards,
Suyong

eguiraud · October 1, 2018, 10:20am

Hi Suyong,
sorry do what exactly?

If you want to define a function that can take in a generic dataframe variable, you can make it a template function in v6.14:

template <typename RDF>
void PlotDF(RDF df) { ... }

and RDF will take the type of whatever dataframe node you pass to the function.

We are moving further from the original topic of the question though

My point to Eddy’s question was simply that RDF does not rely on gDirectory and ROOT’s global state and you have to treat dataframes as standard C++ variables: if you want to create a dataframe in a function and use it from another function you “simply” pass the variable around (in v6.14 this is made less trivial by the fact that you cannot cast every dataframe variable to the same common type).

Cheers,
Enrico

Eddy_Offermann · October 2, 2018, 1:49am

Hi,

I believe that the answers did not address my issue yet, I guess I did not state it clearly.
There is lots of code that does something along the following lines:

void createTree()
{ 
   gDirectory->cd(0);
   TTree *t = new TTree("mytree","sloppy");
}

void useTree()
{
   createTree();
   TTree *t = const_cast<(TTree *>(gDirectory->Get("mytree"));
}

so create somewhere a tree and retrieve it later through the gDirectory or a global variable. I do not know how to accomplish this with RDataFrame. Sure it is easy to
pass RDataFrame “down” to a procedure with a reference but how to pass it “up” ?

Let me give some incorrect(!) concrete examples as discussion points.

void createFrame(ROOT::RDataFrame &d_in,ROOT::RDataFrame &d_out)
{ 
   d_out = d_in.Define("b1",[]() { return 1; });
}

void useFrame()
{
   ROOT::RDataFrame d1(10);
   ROOT::RDataFrame d2;
   createFrame(d1,d2);
}

or what about something with a global variable

ROOT::RDataFrame *gd = nullptr;
  
void createFrame(ROOT::RDataFrame &d_in)
{
   gd = new ROOT::RDataFrame(d_in.Define("b1",[]() { return 1; }));
}

void useFrame()
{
   ROOT::RDataFrame d1(10);
   prod(d1);
}

How would I do something like that in this new paradigm ?

-Eddy

sbinet · October 2, 2018, 5:20am

Have createFrame return the data frame directly instead of void.

This will involve some amount of refactoring.
But will also remove a bunch of magical side effects which are hard to grok at scale.

eguiraud · October 2, 2018, 7:46am

Hi Eddy,
as I said we do not support writing dataframes to TDirectories.

You can certainly use a global variable: if you make it a std::unique_ptr<ROOT::RDF::RNode> you get reasonable lifetime management and the possibility to store any dataframe node.
We do not encourage this programming model though: global state is bad.

As @sbinet suggests, it would probably take a bit of refactoring, but it’s certainly possible to change the code so that dataframes are passed around rather than stored and retrieved. I realized this is a change of paradigm and possibly not a welcome one, but I have seen main functions of RDF analyses look pretty much like this:

auto filtered_df = ApplySelections(MakeRDF());
auto histos = BookHistograms(DefineQuantities(filtered_df));
auto cutReports = filtered_df.Report();
cutReports->Print();

and as far as I can tell this is just as modular and flexible as the approach with a global dataframe variable, without the downsides of global state.

Also consider that building an RDF computation graph is a fairly lightweight operation, so it doesn’t save much time to store an RDF to a TFile on disk for later usage: re-building the computation graph at every execution costs nothing (w.r.t. the actual computation that RDF performs).

I’m sorry I don’t have an answer more in line with the question
Cheers,
Enrico

Eddy_Offermann · October 2, 2018, 12:35pm

Hi sbinet,

I think you suggest something along these lines ,

ROOT::RDataFrame createFrame(ROOT::RDataFrame &d_in)
{ 
  return d_in.Define("b1",[]() { return 1; });
}

void useFrame()
{
  ROOT::RDataFrame d1(10);
  ROOT::RDataFrame d2 = createFrame(d1);
}

It does not work because the necessary constructors have been (luckily) disabled .

Enrico, please give a little skeleton code explaining what ApplySelections
(returning a RDataFrame)
in "auto filtered_df = ApplySelections(MakeRDF()); " would look like.

Thanks, Eddy

eguiraud · October 2, 2018, 1:19pm

Hi Eddy,
with C++14’s automatic return type deduction you could write it as

auto ApplySelections(ROOT::RDF::RNode df)
{
   return df.Filter(cut1, {"col1"}).Filter(cut2, {"cut2"});
}

or (slightly more efficient, but requires a template and people usually try to stay away from templates)

template <typename RDF>
auto ApplySelections(RDF df)
{
  return df.Filter(cut1, {"col1"}).Filter(cut2, {"cut2"});
}

In C++11 you can use ROOT::RDF::RNode as the return type with a small performance hit (I would be curious to see if the performance hit of using a ROOT::RDF::RNode is measurable at all in your application, I bet it itsn’t).

Cheers,
Enrico

Eddy_Offermann · October 2, 2018, 2:21pm

Hi Enrico,

Your example was very enlightening ! The necessary constructors are
sitting at a different place …

Up to this point I have only been playing with the (very good) tutorials dealing
with RDataFrame. It would be helpful to add a few examples showing how to
carry around these containers in the code.

For me this issue is closed .

Thanks, Eddy

eguiraud · October 2, 2018, 3:10pm

Hi Eddy,
good! Yes if you mean ROOT::RDF::RNode, it’s on my to-do list to advertise this better with a tutorial and a clearer mention in the docs.

Cheers,
Enrico

Eddy_Offermann · October 2, 2018, 3:27pm

Hi Enrico,

great !

Yes, some tutorials/doc explaining the relation between these different classes
(RDataFrame <=> RDF <=> …) would be helpful.

Remember you code , we follow …

Going forward unafraid

system · October 16, 2018, 3:30pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.