Chaining RDataFrame::Define does not work in some cases

Hi,

I was messing with the RDataFrame system and found that seems RDataFrame can’t process multiple defines at once. When ever I make multiple defines then save a snapshot. Some branches are not int the resulting file.

For example (I’m trying to perform feature engineering in ROOT):

original_df = ROOT::RDF::MakeRootDataFrame("data", "source.root");
// Basic statistics..
df = original_df.Define(...);
//
// performs feature computing ...
//...
//
// batch_maxs, etc.. are vectors accessable by cling 
df = df.Define("batch_mean", "batch_means[batch]").
    Define("batch_maxs", "batch_maxs[batch]").
    Define("batch_mins", "batch_mins[batch]").
    Define("batch_stds", "batch_stds[batch]");

df.GetColumnNames()
// { "time", "index", "batch", "batch_index", "time", "signal", "open_channels" }
// columns like "batch_stds" are missing

Here’s the full source code.

Am I doing something wrong? Thanks


Please read tips for efficient and successful posting and posting code

ROOT Version: 6.20/01
Platform: Linux
Compiler: GCC 9.2.0


Hi Martin,
could you also share a couple of entries, so I can actually run the code?

My guess without running it: cling’s brand new variable shadowing is not working quite as expected in that notebook. Can you try to change the code to:

auto original_df = ROOT::RDF::MakeRootDataFrame("data", "source.root");
// Basic statistics..
auto df = original_df.Define(...);
//
// performs feature computing ...
//...
//
// batch_maxs, etc.. are vectors accessable by cling 
auto df2 = df.Define("batch_mean", "batch_means[batch]").
    Define("batch_maxs", "batch_maxs[batch]").
    Define("batch_mins", "batch_mins[batch]").
    Define("batch_stds", "batch_stds[batch]");

df2.GetColumnNames()

Note the autos to clearly indicate the definition of new variables and the usage of different names for different dataframe objects.

Cheers,
Enrico

Thanks for the quick reply! The data is available on Kaggle. I have also uploaded the converted .root file

I think something is wrong. Changing to auto df2... causes a segfault.

Thanks! I’ll get back to you as soon as possible.

Hi @marty1885,
there is a few things that do not look ok in the code you shared:

  • the last Defines use column names that have never been defined: batch_maxs, batch_mins and batch_stds are not columns in the dataset
  • the second Foreach is missing the last argument, the columns it should apply the function to
  • you should not use ROOT::MakeRootDataFrame but simply construct a df object: ROOT::RDataFrame(...). This one is completely our fault for the lack of documentation and must be fixed. It is now a bug report
  • in a number of places you define variables a-la python (e.g. var = value). This is a cling extension, and proper C++ would require at least an auto var = value to distinguish definition from assignment. I am not 100% this works as intended in 100% of cases

To find all errors I made the notebook a full-blown program and compiled it (with g++ -o multiple_rdfs multiple_rdfs.cxx $(root-config --libs --cflags) as usual). The following code works as expected as far as I can tell – you can work backwards and find what was the exact culprit in your original notebook if you want:

#include <ROOT/RDataFrame.hxx>
#include <vector>
#include <iostream>
using namespace std;

int main()
{
   auto train_df = ROOT::RDataFrame("data", "train.root");
   
   train_df.GetColumnNames();
   
   auto df = train_df
       .Define("index", [](double time)->int{return int(time*10000)-1;}, {"time"})
       .Define("batch", [](int index)->int{return index/50000;}, {"index"})
       .Define("batch_index", [](int index, int batch)->int{return index - batch*50000;}, {"index", "batch"});
   
   auto n_batches = int(*df.Max("batch"))+1;
   const auto n_sample_per_batch = 50000;
   
   //HACK: Storing intermid values using gloval values
   auto batch_means = vector<double>(n_batches);
   auto batch_maxs = vector<double>(n_batches, -1000);
   auto batch_mins = vector<double>(n_batches, 1000);
   auto batch_stds = vector<double>(n_batches);
   
   df.Foreach([&](int batch, double value){
       batch_means[batch] += value;
       batch_maxs[batch] = max(batch_maxs[batch], value);
       batch_mins[batch] = min(batch_mins[batch], value);
   }, {"batch", "signal"});
   
   for(auto& val : batch_means)
       val /= n_sample_per_batch;
   
   // With the mean avaliable. we can now compute the stddev
   df.Foreach([&](int batch, double value){
       batch_stds[batch] += pow(value - batch_means[batch], 2);
   }, {"batch", "signal"});
   for(auto& val : batch_stds)
       val /= n_sample_per_batch;
   
   auto df_with_features =
      df.Define("batch_mean", [&] (int batch) { return batch_means[batch];}, {"batch"}).
         Define("batch_maxs", [&] (int batch) { return batch_maxs[batch];}, {"batch"}).
         Define("batch_mins", [&] (int batch) { return batch_mins[batch];}, {"batch"}).
         Define("batch_stds", [&] (int batch) { return batch_stds[batch];}, {"batch"});
   
   for (auto &c : df_with_features.GetColumnNames())
      std::cout << c << std::endl;
   
   df_with_features.Snapshot("data", "data_with_featutres.root");

   return 0;
}

Cheers,
Enrico

Ahh, thanks.

It seems the notebook kernel is failing silently and shows no error message even tho there’s a syntax error.

I used this feature to avoid needing to restart the entire notebook to run a cell again; since assignment becomes deceleration if the variable does not exist. Is the standard C++ style code encouraged under ROOT 6.20? Or the CINT/cling style code fine? May you point me to somewhere describing the behavior of 6.20’s variable shadowing? Does it release the shadowed object, etc (so I can minimize memory leaks)?

I’m not sure if this is an intended behavior. From experiments I deduce that Defines can see global variables that aren’t the columns. So I built the vectors outside and let Define to read them back.

Yes indeed the underlying issue is that the jupyter notebook does not report compilation errors faithfully. Maybe @etejedor can comment on why that is.

I listed the changes I did to make it compile as a standard C++ program and to make it work properly. It is indeed probable that several of my changes are not actually needed when running through ROOT’s interpreter and/or in a jupyter notebook. You can try to find what changes are actually needed in the jupyter notebook, it should be quick trial and error. It could be that the only serious issue was MakeRootDataFrame.

As per your other questions:

  • I do not have experience using C++ inside jupyter notebooks and therefore I am not 100% sure of what non-standard C++ works just fine and what might cause issues. When in doubt, I always fall back to compiled C++ to get proper errors, at least as a way of debugging. But that’s just me :smile:
  • if i remember correctly, variable shadowing is really just that, a shadowing of the variable name, and the shadowed object stays in memory. @Axel might be able to point you to the appropriate docs
  • indeed the interpreter might be able to access global variables when just-in-time compiling your Define expressions. That was most probably not what caused the weirdness you saw

I hope I have clarified most things. Let us know if you encounter any other problem.
Cheers,
Enrico

Hi,
The issue with the notebook not showing a complete error is being followed here:
https://sft.its.cern.ch/jira/browse/ROOT-10589

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.