RDataFrame .Count() and .Report() re-looping over whole DataFrame?

Hi,
I am running the very short macro below to retrieve the number of events passing the .Filter().Define() commands from a large TNtupleD with many columns:

#include <iostream>
#include <fstream>
#include <string>
#include <cstdio>
#include "TStopwatch.h"

int ntuple2pos_rdf_report(){
  
  TStopwatch timer;
  timer.Start();

  std::ofstream partview;
  partview.open( "toto.pos" );

  
  ROOT::EnableImplicitMT();
  
  // Input rootfile (24 GB) and TNtupleD names
  auto fileName = "23Mg1p_N100000_R5.0_E0.25_MC10_NBNN_P1_T300.0_Euler_NBodyNN_He_1.000000.root";
  auto treeName = "ParticleData";

  // RDataFrame from tree with default variable name for the time
  ROOT::RDataFrame rdf(treeName, fileName, {"t"});

  
  // Sets cut and new variable definition
  auto rdf2 = rdf.Filter("t > -10 && t < 10", "Cut") // <- no column name specified here, "t" taken as default!
                 .Define("v", "sqrt(vx*vx+vy*vy+vz*vz)");

  // Loops over rows and store in ASCII file
  rdf2.Foreach( [&partview] (double xi, double yi, double zi, double vi)
                 { partview << "SP(" << xi << ", " << yi << ", " << zi << "){" << vi << "};" << std::endl;},
                 {"x", "y", "z", "v"} );
    

  // auto nevts = rdf2.Count(); std::cout << "Particles left : " << *nevts << std::endl;
  
  // auto allCutsReport = rdf.Report();
  // // We can now loop on the cuts
  // std::cout << "Name\tAll\tPass\tEfficiency" << std::endl;
  // for (auto &&cutInfo : allCutsReport) {
  //    std::cout << cutInfo.GetName() << "\t" << cutInfo.GetAll() << "\t" << cutInfo.GetPass() << "\t"
  //              << cutInfo.GetEff() << " %" << std::endl;
  //    auto nevts = cutInfo.GetPass();
  //    partview << "Particles left : " << nevts << std::endl;
  // }

  partview.close();

 
  timer.Stop();
  printf("RT = %7.3f s     CPU = %7.3f s\n", timer.RealTime(), timer.CpuTime());

  return 0;
}

ntuple2pos_rdf_report.C (1.9 KB)

When I uncomment either line 37 (for .Count() function call) or lines 39 to 47 (for .Report() and cut information retrieval) it doubles the computing time.
I would think the .Foreach() command at line 32 would store this information somewhere in the RDataFrame structure as this Foreach() command has already processed the whole data and applied the filter/cut.
Perhaps am I giving the list of commands in a someway weird order ?

Thanks for any help.


Please read tips for efficient and successful posting and posting code

_ROOT Version: 6.24
_Platform: Linux Ubuntu 20.04.3 LTS
_Compiler: gcc 9.3.0


Hi @quemener ,
Foreach is a bit special in that it is an “instant action”: differently from most other RDF actions, as it does not return anything, Foreach is executed on the spot, at the point you call it (it’s not lazy).

Then after the event loop with the Foreach is executed, you ask for a Count() and print it. There RDF has to run another event loop, as it did not know it had to produce a Count until that point.

Then you ask for a Report and print it, same story.

This will require only one event loop instead:

auto nevts = rdf2.Count();
auto allCutsReport = rdf.Report();
rdf2.Foreach(...); // loop runs here and also produces the count and the report
std::cout << "Particles left: " << *nevts << '\n';
for (auto &&cutInfo : allCutsReport) {
  ...
}

In other words, book everything first, then trigger the event loop (with a Foreach, in this case).

I hope this helps.
Cheers,
Enrico

Many thanks, it cuts down the time !

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.