Hii. I was trying to read some events with RDataFrame. The following is my code. It is taking too much time. I mean, if i do the same with TChain, that is much faster. I am making a mistake but can’t figure out where exactly?
ROOT::EnableImplicitMT();
ROOT::RDataFrame d("events", "out.root");
auto entries = d.Filter("np>0").Count();
auto max = d.Filter("np>0").Max("np");
int n = *max;
// function to filter centrality
auto iscent = [](float x) { return x < 3.12; };
// vector of events to loop over
std::vector<int> idv(*entries);
std::iota(idv.begin(), idv.end(), 0);
// function to filter multiplicity
auto ismult = [](int x) { return x > 0.; };
// function to check if status is 1
auto istatus = [](ROOT::RVecI &x) { return x == 0; };
int mult;
ROOT::RVec<int> stat;
ROOT::RVecF pxvec(n), pyvec(n), pzvec(n), evec(n);
float imp;
TVector3 part;
TH1 *h = new TH1F("h", "h", 100, 0, 5e3);
for (const auto &i : idv) {
imp = d.Take<float>("bim").GetValue()[i];
if (!iscent(imp))
continue;
mult = d.Take<int>("np").GetValue()[i];
if (!ismult(mult))
continue;
for (int j = 0; j < mult; ++j) {
stat = d.Take<ROOT::RVecI>("ist").GetValue()[j];
if (stat.at(j) != 0)
continue;
pxvec = d.Take<ROOT::RVecF>("px").GetValue().at(j);
pyvec = d.Take<ROOT::RVecF>("py").GetValue().at(j);
pzvec = d.Take<ROOT::RVecF>("pz").GetValue().at(j);
evec = d.Take<ROOT::RVecF>("e").GetValue().at(j);
if (pxvec.at(j) == 0 || pyvec.at(j) == 0 || pzvec.at(j) == 0)
continue;
part.SetXYZ(pxvec.at(j), pyvec.at(j), pzvec.at(j));
cout << "px: " << pxvec.at(j) << " py: " << pyvec.at(j)
<< " pz: " << pzvec.at(j) << " eta: " << part.Eta()
<< " pt: " << part.Pt() << endl;
}
}
}
calling GetValue immediately after booking the action forces RDataFrame to immediately start an event loop. So you are performing 5 event loops for each iteration of the for loop!!!
You should book all desired actions first, use the results later. In the future by activating RDF logs you should be able to diagnose these kind of issues.
Hello @eguiraud . I hope you are good? Thank you so much for the reply. So, I should first do the filters and then store values for whatever calculation i need to perform? For filters, i tried to:
// function to filter multiplicity
auto ismult = [](int x) { return x > 0.; };
// function to check if status is 1
auto select_final_had = [](const RVecI &status) { return status == 1; };
// function to check if px and py are >0
auto select_good_tracks = [](const RVecF &px, const RVecF &py) {
return (px > 0.f) && (py > 0.f);
};
auto df = d.Filter(iscent, {"bim"})
.Filter(ismult, {"np"})
.Filter(select_good_tracks, {"px", "py"})
.Filter(select_final_had, {"ist"});
it doesn’t work. because px py ist are from track loop, i get it. but then how should i put a cut on them, and store the rest of values. I am definitely missing something here.
yes there are a few things that look like they can be optimized.
Here:
auto df = d.Filter(iscent, {"impact"}).Filter(ismult, {"np"});
auto entries = df.Filter("np>0").Count();
auto max = df.Filter("np>0").Max("np");
int n = *max; // event loop for Count and Max runs here
You are running one event loop just for the Count+Max calculation, but you don’t actually use the result of the Max and you don’t actually need the result of the Count. The Count is used here:
// vector of events to loop over
std::vector<int> idv(*entries);
std::iota(idv.begin(), idv.end(), 0);
but rather than allocating a potentially very large vector that just counts from 0 to nEntries you should instead simply use a counter, so below:
auto trackid = df.Take<RVecI>("particleid");
for (ULong64_t i = 0; i < trackid->size(); ++i) {
Finally, you are still running one event loop per iteration of the for loop, which is terribly slow, because the multi array needs to be re-evaluated for every loop iteration:
for (const auto &i : idv) {
auto multi = df.Take<int>("np"); // this registers a new RDF operation for every loop iteration!
std::vector<int> idv2(multi.GetValue()[i]); // so you run a full event loop every time at this line!
You should try to rearrange the logic so that you first book all the Takes you need and then access the results, so RDF can produce them all in one go.
You can check how many event loops RDF is running with df.GetNRuns() or by activating the RDF logs (see here).