RDataframe event and track loop

Hii. I was trying to read some events with RDataFrame. The following is my code. It is taking too much time. I mean, if i do the same with TChain, that is much faster. I am making a mistake but can’t figure out where exactly?

 ROOT::EnableImplicitMT();
  ROOT::RDataFrame d("events", "out.root");
  auto entries = d.Filter("np>0").Count();
  auto max = d.Filter("np>0").Max("np");
  int n = *max;
  // function to filter centrality
  auto iscent = [](float x) { return x < 3.12; };

  // vector of events to loop over
  std::vector<int> idv(*entries);
  std::iota(idv.begin(), idv.end(), 0);

  // function to filter multiplicity
  auto ismult = [](int x) { return x > 0.; };

  // function to check if status is 1
  auto istatus = [](ROOT::RVecI &x) { return x == 0; };
  int mult;
  ROOT::RVec<int> stat;
  ROOT::RVecF pxvec(n), pyvec(n), pzvec(n), evec(n);
  float imp;
  TVector3 part;
  TH1 *h = new TH1F("h", "h", 100, 0, 5e3);
  for (const auto &i : idv) {
    imp = d.Take<float>("bim").GetValue()[i];
    if (!iscent(imp))
      continue;
    mult = d.Take<int>("np").GetValue()[i];
    if (!ismult(mult))
      continue;

    for (int j = 0; j < mult; ++j) {
      stat = d.Take<ROOT::RVecI>("ist").GetValue()[j];
 

        if (stat.at(j) != 0)
             continue;
           pxvec = d.Take<ROOT::RVecF>("px").GetValue().at(j);
           pyvec = d.Take<ROOT::RVecF>("py").GetValue().at(j);
           pzvec = d.Take<ROOT::RVecF>("pz").GetValue().at(j);
           evec = d.Take<ROOT::RVecF>("e").GetValue().at(j);

           if (pxvec.at(j) == 0 || pyvec.at(j) == 0 || pzvec.at(j) == 0)
             continue;
           part.SetXYZ(pxvec.at(j), pyvec.at(j), pzvec.at(j));
           cout << "px: " << pxvec.at(j) << " py: " << pyvec.at(j)
                << " pz: " << pzvec.at(j) << " eta: " << part.Eta()
                << " pt: " << part.Pt() << endl; 
    }


  }
}

I guess @eguiraud or @vpadulan can help.

1 Like

Hi @Marooz_Malik ,

calling GetValue immediately after booking the action forces RDataFrame to immediately start an event loop. So you are performing 5 event loops for each iteration of the for loop!!!

You should book all desired actions first, use the results later. In the future by activating RDF logs you should be able to diagnose these kind of issues.

Cheers,
Enrico

1 Like

Hello @eguiraud . I hope you are good? Thank you so much for the reply. So, I should first do the filters and then store values for whatever calculation i need to perform? For filters, i tried to:

  // function to filter multiplicity
  auto ismult = [](int x) { return x > 0.; };

  // function to check if status is 1
  auto select_final_had = [](const RVecI &status) { return status == 1; };

  // function to check if px and py are >0
  auto select_good_tracks = [](const RVecF &px, const RVecF &py) {
    return (px > 0.f) && (py > 0.f);
  };
  auto df = d.Filter(iscent, {"bim"})
                .Filter(ismult, {"np"})
                .Filter(select_good_tracks, {"px", "py"})
                .Filter(select_final_had, {"ist"});

it doesn’t work. because px py ist are from track loop, i get it. but then how should i put a cut on them, and store the rest of values. I am definitely missing something here.

Not quite. Calling GetValue for a result that has not been produced yet forces RDataFrame to start an event loop.

So if you do this:

pxvec = d.Take<ROOT::RVecF>("px").GetValue().at(j);
pyvec = d.Take<ROOT::RVecF>("py").GetValue().at(j);

RDF is forced to run 2 event loops.

You should refactor your code so that you do all the Take calls first, and only afterwards you do all the GetValue calls.

Cheers,
Enrico

Thank you so much. I wrote something. Can you have a look at it? i mean, can this still be optimised?

  ROOT::EnableImplicitMT();
  ROOT::RDataFrame d("tree", "input.root");
  TFile f("out.root", "RECREATE");

  // function to filter centrality
  auto iscent = [](float x) { return x < 3; };

  // function to filter charged tracks
  auto ischarged = [](int x) {
    int chrged_had_id[] = {120, 130, 240, 340, 150, 121, 131, 241, 341};
    for (int i = 0; i < 9; ++i) {
      if (x == chrged_had_id[i])
        return true;
    }
    return false;
  };

  auto df = d.Filter(iscent, {"impact"}).Filter(ismult, {"np"});
  auto entries = df.Filter("np>0").Count();
  auto max = df.Filter("np>0").Max("np");
  int n = *max;

  // vector of events to loop over
  std::vector<int> idv(*entries);
  std::iota(idv.begin(), idv.end(), 0);

  // vector of Mbins and ptbins
  std::iota(mbinvec.begin(), mbinvec.end(), 0);
  std::iota(nptbinsvec.begin(), nptbinsvec.end(), 0);

  TLorentzVector part;
  auto evimpact = df.Take<float>("impact");
  auto trackpx = df.Take<RVecF>("px");
  auto trackpy = df.Take<RVecF>("py");
  auto trackpz = df.Take<RVecF>("pz");
  auto tracke = df.Take<RVecF>("energy");
  auto trackid = df.Take<RVecI>("particleid");

  for (const auto &i : idv) {
    auto multi = df.Take<int>("np");
    std::vector<int> idv2(multi.GetValue()[i]);
    hmultiEv->Fill(idv2.size());
    std::iota(idv2.begin(), idv2.end(), 0);
    reset();
    for (const auto &j : idv2) {
      if (!ischarged(trackid.GetValue()[i][j]))
        continue;
      hpid->Fill(trackid.GetValue()[i][j]);
      part.SetPxPyPzE(trackpx.GetValue()[i][j], trackpy.GetValue()[i][j],
                      trackpz.GetValue()[i][j], tracke.GetValue()[i][j]);
      double trackpt = part.Pt();
      double tracketa = part.Eta();
      double trackphi = part.Phi() + TMath::Pi();
      hphi->Fill(trackphi);
      heta->Fill(tracketa);
      ptnocount.clear();
 
    }
    if (i % 1000 == 0)
      std::cout << "Event " << i << " done" << std::endl;
  }
  f.Write();
  f.Close();

Hi @Marooz_Malik ,

yes there are a few things that look like they can be optimized.

Here:

  auto df = d.Filter(iscent, {"impact"}).Filter(ismult, {"np"});
  auto entries = df.Filter("np>0").Count();
  auto max = df.Filter("np>0").Max("np");
  int n = *max; // event loop for Count and Max runs here

You are running one event loop just for the Count+Max calculation, but you don’t actually use the result of the Max and you don’t actually need the result of the Count. The Count is used here:

  // vector of events to loop over
  std::vector<int> idv(*entries);
  std::iota(idv.begin(), idv.end(), 0);

but rather than allocating a potentially very large vector that just counts from 0 to nEntries you should instead simply use a counter, so below:

  auto trackid = df.Take<RVecI>("particleid");

  for (ULong64_t i = 0; i < trackid->size(); ++i) {

Finally, you are still running one event loop per iteration of the for loop, which is terribly slow, because the multi array needs to be re-evaluated for every loop iteration:

  for (const auto &i : idv) {
    auto multi = df.Take<int>("np"); // this registers a new RDF operation for every loop iteration!
    std::vector<int> idv2(multi.GetValue()[i]); // so you run a full event loop every time at this line!

You should try to rearrange the logic so that you first book all the Takes you need and then access the results, so RDF can produce them all in one go.

You can check how many event loops RDF is running with df.GetNRuns() or by activating the RDF logs (see here).

Cheers,
Enrico

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.