RDataFrame argument passing cont'd (lazy evaluation)

Hi,

I still seem to have issues passing RDataFrame arguments.
I have attached a condensed example with data file.

I read a csv file into a RDataFrame (MakeCvsDataFrame) followed by defining new columns (MakeDataFrame). Finally the dataframe is filtered (FilterDataFrame), but there is a small twist. The contents is “unpacked” through “Take” statements into vectors. (All this to implement Pandas groupby functionality), processed and put back into a RDataFrame. In each routine the “fCount” column is successfully summed. However, if this summation is done in the main body on the final RDataFrame object, a crash occurs . What went wrong ?!

-Eddy

void mytest()
{
  std::string fname_csv  = "ticket.txt";
  auto d_csv = ROOT::RDF::MakeCsvDataFrame(fname_csv);

  auto d = MakeDataFrame(d_csv);

  std::string select = "fCountry==\"CH\"&&fSYear==2016";
  auto d_select = FilterDataFrame(d,select);

  Long64_t count_sum = 0;
  d_select.Foreach([&count_sum](Long64_t count){count_sum += count;},{"fCount"} );
  std::cout << "sum counts in d_select: " << count_sum << std::endl;
}

mytest.C (4.4 KB)
ticket.txt (974 Bytes)
_ROOT Version: 6.20.02
_Platform: MacOSX
Compiler: Not Provided


Hi,
one problem I see is that data is local to FilterDataFrame and you capture it by reference, so outside of FilterDataFrame you have a dangling reference used within RDF.

As an aside, you can use Sum("fCount") or Sum<Long64_t>("fCount") instead of the Foreach to do the count (unrelated to the crash).

Cheers,
Enrico

Hi Enrico,

Thank you for your answer. My tests are of course a bit contrived pieces
of code trying to show my issue. Normally I would use “Sum”.

Coming to the issue. The variable data in the routine “FilterDataFrame”
is local and passed in the lambda function as a reference:

std::vector<TData> data;
.
.
auto tdf_cons =
    tdf_empty.DefineSlotEntry("data",[&data](unsigned int /*slot*/,ULong64_t entry) {return data[entry]; } )

Indeed, when I take the reference away the code works. I am a bit at a loss
here, why would passing an argument as a reference cause a dangle issue ?

-Eddy

why would passing an argument as a reference cause a dangle issue

The lambda that RDataFrame uses has a reference to data, and accesses data by reference. If data goes out of scope, that lambda will still try to access data via that reference, but it will now try to read an already deleted object (which is undefined behavior, but in your case, as it often happens, it leads to a segfault).

I understand, but the object data will go out of scope anyhow, so way is the fact that I reference it relevant :flushed: Is that something specific about the implementation in RDataFrame or just in general true for lambda functions.

I was hoping that my “Foreach” action was forcing the execution of the
lambda code.

How about you use: [=data]

I was hoping that my “Foreach” action was forcing the execution of the lambda code.

It does, and then the lambda is also executed when the Foreach in mytest runs.

RDataFrame is not doing anything special other than executing the code you pass to it when you trigger an event loop (e.g. by calling Foreach).

why is the fact that I reference it relevant

Accessing a reference to an object that is out of scope is undefined behavior in C++. I don’t know if that’s the only problem, but it definitely looks like a bug. This is a simplified version of what’s happening:

ROOT::RDF::RNode DefineAThing(ROOT::RDF::RNode& df) {
  int x = 42;
  return df.Define("x", [&x] { return x; });
}

int main() {
  ROOT::RDataFrame df(1);
  auto df_with_x = DefineAThing(df);
  df_with_x.Foreach([](int x) { std::cout << x << std::endl; }, {"x"});
  return 0;
}

That Foreach requests “x”, so the lambda that produces it is invoked, and that lambda will try to access the x variable that has gone out of scope in the meanwhile.

Hope this clarifies things.
Cheers,
Enrico

Hi Wile, that indeed avoids the problem but has the undesirable side-effect that it copies the object, which can be very large. -Eddy

So, maybe you could also use: std::vector<TData> *data = new ...;

Indeed, but than I probably have a memory leak

How about some “std::shared_ptr” or “std::unique_ptr” then (that’s just an idea, not that I have anything that you could use right now, sorry)?

I would just split the function in two, return data to the main scope from the first split function, then pass it to the second split function. Returning a local variable is an implicit std::move, so no copies.

Hi Enrico,

Great I learned something and quite frankly I am a bit in shock.
I thought that RDataFrame had a lazy evaluation implementation
that only would execute its action once(!) when necessary. Turns out
that it executes it each time.

So I thought that enforcing an action at the end of the routine would
fill out all the columns for once and for all. I start to realize why my
large code lights up like a Christmas tree :crazy_face:

ROOT::RDF::RNode DefineAThing(ROOT::RDF::RNode df) {
  int x = 42;
  auto df1 = df.Define("x", [&x] { std:cout << "Hello lambda" << std::endl; return x; });
  df1.Foreach([](int x) { std::cout << x << std::endl; }, {"x"});
  return df1;
}

int mytest10() {
  ROOT::RDataFrame df(1);
  auto df_with_x = DefineAThing(df);
  for (int i = 0; i < 5; i++)
    df_with_x.Foreach([](int x) { std::cout << x << std::endl; }, {"x"});
  return 0;
}

So lesson learned if you have a RDataFrame that is expensive to calculate
make a snapshot.

-Eddy

Thank you Wile,
I completely misunderstood the RDataFrame implementation, just learned
from the tutorials that show only limited ways of using it.

I thought that RDataFrame had a lazy evaluation implementation that only would execute its action once(!) when necessary

I’m not sure how that could work in general, the memory requirement for caching the result of all Filters and all Defines in an event loop could potentially be enormous. RDF could in principle have in internal caching mechanism for a small enough amount of Define/Filter evaluations, but it does not (nor it claims to).

Typically when tutorials say RDF is lazy and only runs one event loop, when necessary, what is meant is this:

RDataFrame df(...);
auto h1 = df.Histo1D("x"); // lazy, nothing is actually run
auto h2 = df.Histo1d("y"); // lazy too
h2->Draw(); // event loop necessary, so run, once, filling both histos

About Defines: they are executed at most once per entry per event loop. Their evaluation is also lazy, and cached per event: if, e.g. due to a Filter, the result of a Define is not required in a given event, it is not evaluated; if the result of a Define is required multiple times when processing a given event, it is not re-evaluated.

Any suggestion for improvements to tutorials and docs is warmly welcome. Actual pull requests doubly so :smile:

Cheers,
Enrico

Hi Enrico,

For some reason, I thought that lazy meant that the columns would be
calculated once needed and then stored and an internal status flag would
indicate that. I do now understand the current implementation and will
program accordingly.

In my code the storage element data is now carried around with the
lambda function through the whole code so that at an action point the
RDataFrame can be calculated. Now remember that I started with
having references to RDataFrame objects in order to avoid this. The
compiler could not handle it.

It seems to me that given the way RDataFrame is implemented I should
make judiciously Snapshots at certain point of the graph to avoid repeated
expensive calculations.

If I reach the finish, I will try to make suggestions for the tutorials.
Pull requests are a stretch, you want ROOT to show best practices :wink:.

-Eddy

Hi,
Cache might be what you are looking for.

Cheers,
Enrico

Hi Enrico,

I wanted to test/show something with Cache but encounter the following problem.
I take from the tutorials the usual routine to fill the data frame. However, I do
not seem to have access to its columns in this routine ?

void fill_tree(const char *treeName, const char *fileName)
{ 
   ROOT::RDataFrame d1(10);
   int i(0);
   d1.Define("b1", [&i]() { return (double)i; })
      .Define("b2",
              [&i]() {
                 auto j = i * i;
                 ++i;
                 return j;
              })
      .Snapshot(treeName,fileName);

   auto result = d1.Count();

   auto d2 = d1;
   std::cout << "in fill_tree" << std::endl;
   std::vector<std::string> colNames = d2.GetColumnNames();
   for (auto &&name: colNames) {std::cout << name <<" ";} std::cout <<std::endl;
}

void mytest()
{
     // We prepare an input tree to run on
   auto fileName = "df001_introduction.root";
   auto treeName = "myTree";
   fill_tree(treeName, fileName);

   ROOT::RDataFrame d(treeName, fileName, {"b1"});

   std::cout << "in mytest" << std::endl;
   std::vector<std::string> colNames = d.GetColumnNames();
   for (auto &&name: colNames) {std::cout << name <<" ";} std::cout <<std::endl;
}

Hi Eddy,
d1 is the empty RDF. auto d2 = d1 copies that. You probably want

auto d2 = d1.Define(...).Define(...);
d2.Snapshot(...);

Cheers,
Enrico

Hi Enrico,

I guess at some point I will grasp this “forward node” thing.

Coming to the Cache-ing . It solves my issues with dangling references,
moving objects around. For the reader here a small example:

auto fill_tree()
{ 
   std::cout << "fill_tree: " << std::endl;

   ROOT::RDataFrame d1(4);

   int i(0);
   auto d2 = d1.Define("b1", [&i]() {
                 if (i<=0) std::cout << "  Lambda function" << std::endl;
                 return (double)i++; });

   auto d3 = d2.Cache();

   auto d4 = d3.Define("b2",
              [](double b1) {
                 auto j = b1 * b1;
                 return j;
              },{"b1"});
   double sum = 0;
   d4.Foreach([&sum](double b){ sum += b; },{"b1"});
   std::cout << "  sum: " << sum << std::endl;

   return d3;
}

void mytest()
{
   auto d1 = fill_tree();

   std::cout << "mytest: " << std::endl;

   auto d1_print = d1.Display();
   d1_print->Print();

   auto d2 = d1.Define("b2",
              [](double b1) {
                 auto j = b1 * b1;
                 return j;
              },{"b1"});
   double sum = 0;
   d2.Foreach([&sum](double b){ sum += b; },{"b1"});
   std::cout << "  sum: " << sum << std::endl;
}

The routine fill_tree creates a frame (the Lambda function prints a line when
invoked) and a reference is used.

The result is cached and passed to the main routine, where the lambda function
is not invoked anymore (and therefore the dangling reference is not an issue).

-Eddy