RDataFrame: using Snapshot() after Foreach()

bfontana · July 31, 2019, 1:16pm

I have been trying to call the Snapshot() method after Foreach() while using the RDataFrame class, but I get the following error:

error: invalid use of 'void'

Pseudo-code that corresponds to what I wrote:

ROOT::EnableImplicitMT(10);
ROOT::RDataFrame d("tree_name", "file_name.root");
d.Define("some_variable", "new_variable")
.Foreach(lambda_function, {"var1", "var2", ...})
.Snapshot("new_data", "new_file.root", {"var1"});

It gives the same error when using ForeachSlot() instead.
The lambda function was defined as follows:

auto func = [&](float var1, float var2, ...) { /*content*/ };

I also tried to explicitly specify its return type with -> void to no avail.

ROOT Version: 6.14/09
Compiler: gcc 7.4.1

eguiraud · July 31, 2019, 1:30pm

Hi,
as per the docs, Foreach is an “instant action” that triggers the event loop on the spot, and does not return anything (i.e. Foreach returns void and you can’t call Snapshot on void).

You can run

d.Foreach(lambda_function, {"var1", "var2", ...});
d.Snapshot("new_data", "new_file.root", {"var1"});

instead, which is more explicit about the fact that there are two event loops being executed, since both Foreach and Snapshot are instant actions.

To run everything in one event loop, you can make Snapshot lazy:

ROOT::RDF::RSnapshotOptions opts;
opts.fLazy = true;
d.Snapshot("new_data", "new_file.root", {"var1"}, opts); // event loop not run here
d.Foreach(lambda_function, {"var1", "var2", ...}); // event loop always run on a Foreach

Cheers,
Enrico

bfontana · July 31, 2019, 2:35pm

Thanks for such a quick answer.
Just for curiosity, is there any reason why Foreach() cannot be made lazy as well?

eguiraud · July 31, 2019, 2:41pm

RDataFrame actions return (smart pointers to) results, and the event loop is triggered upon first access to any of the results. Foreach does not produce any result, so there is no natural trigger for the event loop other than the call to the method itself. We could have had Foreach return a dummy result object that contains nothing, but we felt it would have been more awkward than the actual solution. Arguable decision, admittedly, but since it’s easy to make things work as you want (just make everything else lazy and call Foreach last) I don’t think it’s a big issue.

Cheers,
Enrico

bfontana · July 31, 2019, 3:08pm

There is maybe an issue with that approach: how would you work with Snapshot(), Foreach() and some transformations?

In case one wants to snapshot a new variable processed by Foreach(), one can try:

d.Definition("var", "/*return statement*/").Snapshot(..., {"var"});
d.Foreach(..., {"var"});

but this will fail because Foreach() does not have access to the previous transformation. If, instead, Snapshot() is called after Foreach(), the former will be the one without access to “var”.
It is possible to chain some “lazy” transformations, or else maybe a better approach exists?

eguiraud · July 31, 2019, 3:11pm

Hi,
of course it’s possible. See the user guide I linked above as well as the tutorials.

For example:

auto df_with_define = df.Define("some_var", some_function);
df_with_define.Snapshot(...);
df_with_define.Foreach(...);

Cheers,
Enrico

system · August 14, 2019, 3:11pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.