RDataFrame iterative snapshots

Dear all,
I would like to ask what is the most optimal way to use RDataFrame to perform iterative snapshots.

A basic example i currently have :

vector<TCut> _Selections = { ....  } ;
TChain * chain = ...(something already built beforehand); 
ROOT::RDataFrame df( *chain); 
int i =0; 
  for(  auto && set:_Selections){
    auto Cut    = TString(set);
    TString sel = TString("SELECTION")+to_string(i);
    auto df_with_selection_column = df.Define( sel.Data(), Cut.Data() );
    auto df_filtered = df_with_selection_column.Filter( [](bool  wcut){ return wcut==true; } ,{sel.Data()} );        
    TString fileename = "file_"+to_string(i)+".root";
    df_filtered.Snapshot("tuple", fileename.Data());
   i++;
  }

The way i am doing this is currently running sequentially each snapshots.
Do you see any kind of improvement one can do here?

Thanks
Renato


_ROOT Version: 6.14/04
Platform: x86-slc6-gcc62-opt
Compiler: gcc62


Hi,
iterative as in “in a loop”? You can pass RSnapshotOptions with the fLazy flag set to true to the Snapshot call, so it does not trigger an event loop. And you can trigger an event loop that performs all of the Snapshots in one go at a later time.

Hope this helps,
Enrico

Hi @eguiraud,

So basically what i tried is :

  bool last_loop = false;
  i =0; 
  for(  auto && set:_Selections){
    if( i == _Selections.size()-1) last_loop = true; 
    auto Cut    = TString(set);
    TString sel = TString("SELECTION")+to_string(i);
    auto df_with_selection_column = df.Define( sel.Data(), Cut.Data() );
    auto df_filtered = df_with_selection_column.Filter( [](bool  wcut){ return wcut==true; } ,{sel.Data()} );        
    TString fileename = "file_"+to_string(i)+".root";
    ROOT::RDF::RSnapshotOptions opts;
   opts.fLazy = true; 
    if( last_loop == true){
            opts.fLazy = false;
     }
    df_filtered.Snapshot("tuple", fileename.Data() , {}, opts);
   i++;
 } 

But still it does the snapshot in a sequential way.

I guess the fact is that my selections are “kind” of exclusive and i use that to split my intial sample.

So it’s more like :

1 DataFrame -> N-Define(for the selection expression) -> N-Snapshots of N filters.

Since i want to write all those N-Snapshots to the same file with different names, i am wonder if this is something not supported from RDataFrame.

In other words, I have 10 Snapshots from 10 different Filters, and those Filters are not one after another but all applied to the initial DataFrame.
What I was thinking is that the event loop on the original DataFrame with the 10 different - independent "Define + Filter + Snapshot " afterwards can be done in a single event loop or if I have anyway to do it sequentially.

You can write 10 TTrees at the same time (i.e. in the same loop over your input).
If you write them to the same TFile, though, you will have problems: the first problem is that each Snapshot recreates the TFile – you can change that behaviour to update the TFile incrementally with RSnapshotOptions. But I’m still quite sure ROOT does not support several TFiles writing to the same file on disk in an interleaved fashion.

You would be much better off writing 10 TFile’s and merging them with hadd afterwards.

EDIT: but wait: your code is writing 10 different files. what do you mean “it does the snapshot in a sequential way”? it should do them all in one event loop

Hi @eguiraud, yes that’s what i am currently trying to do , the issue is that I am not 100 % sure how to restructure the loop to achieve what you were saying

what i did is before lunching the loop , in pseudo-code

 auto allcolumns = df.GetColumnNames();
for( auto & sel : Selections){ 
   //----change name to selectionalias...
   df.Define( sel_i, selection_i).Filter( .... ).Snapshot( TreeI, FileI, allColumns , Opts.Lazy=true); 
} 
///how can i trigger all the Snapshots here ?

For example you can trigger the event loop with *df.Count().
But i’m not sure what’s wrong with your snippet of post number 3. That should work.

I tried wha tyou suggested but I am getting :


 *** Break *** segmentation violation
[/usr/lib/system/libsystem_platform.dylib] _sigtramp (no debug info)
[<unknown binary>] (no debug info)
[<unknown binary>] (no debug info)
[<unknown binary>] (no debug info)
[/Users/lpnhe/root/build_root/lib/libCling.so] cling::Interpreter::RunFunction(clang::FunctionDecl const*, cling::Value*) (no debug info)
[/Users/lpnhe/root/build_root/lib/libCling.so] cling::Interpreter::EvaluateInternal(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, cling::CompilationOptions, cling::Value*, cling::Transaction**, unsigned long) (no debug info)
[/Users/lpnhe/root/build_root/lib/libCling.so] TCling::Calc(char const*, TInterpreter::EErrorCode*) /Users/lpnhe/root/core/metacling/src/TCling.cxx:3194
[/Users/lpnhe/root/build_root/lib/libROOTDataFrame.so] ROOT::Detail::RDF::RLoopManager::BuildJittedNodes() /Users/lpnhe/root/tree/dataframe/src/RLoopManager.cxx:423
[/Users/lpnhe/root/build_root/lib/libROOTDataFrame.so] ROOT::Detail::RDF::RLoopManager::Run() /Users/lpnhe/root/tree/dataframe/src/RLoopManager.cxx:459
[/Users/lpnhe/Desktop/RKstar/ewp-rkstz/analysis/./build/Darwin/sameTuple_differentCuts.out] main /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:4467
[/usr/lib/system/libdyld.dylib] start (no debug info)

Ok multiple lazy Snapshots in a loop are harder than I thought.

  • as you noticed, passing {} to Snapshot stores no columns. What you want is "" or ".*" to indicate all columns
  • you need to keep around the RResultPtr returned by Snapshot until the event loop, otherwise the action you booked will be forgotten. This is a bit nasty, in general it’s what you want, but probably not for Snapshot, I will look into whether we can change this behavior

Minimal working example:

#include <ROOT/RDataFrame.hxx>
#include <iostream>

int main()
{
   auto logStart = [](ULong64_t e) { if (e == 0) std::cout << "event loop started!!" << std::endl; return true; };
   auto df = ROOT::RDataFrame(10).Filter(logStart, {"rdfentry_"}).Define("x", [] { return 42; });

   ROOT::RDF::RSnapshotOptions opts;
   opts.fLazy = true;
   using SnapRet_t = ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager>>;
   std::vector<SnapRet_t> rets;
   for (auto i = 0; i < 5; ++i)
      rets.emplace_back(df.Snapshot("t", "f" + std::to_string(i) + ".root", ".*", opts));

   *df.Count();

   return 0;
}

Cheers,
Enrico

Thanks a lot!
I combined this with a “recursive-define” + Filter per loop and looks like it’s working.
Still testing a bit. But the different files are successfully written in paralle.

For the record, with this approach on Lxplus the routine commonly used tree-copyTuple(cut) for a set of cuts is 50 times faster (and scales impressively for a large set of selections to apply and on the number of events) reducing the time spent. Not only, the condor job associated can be asked with the faster queue . Overall a factor 100 speedup w. R. T an old style workflow.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.