Filters Implemented with RDataFrame taking too long

yburkard · November 16, 2022, 2:07pm

Dear Users,

I am creating columns on RDataFrame object df containing arrays of Lorentz-vectors (namely ROOT::VecOps::RVec<PxPyPzEVector> objects) and then applying filters on them, but this is taking very long (for an RDataFrame object of 100’000 columns it takes about 30 minutes to apply the filters). After doing some inspection I realised that the new columns are generated every time a filter is applied, which I suppose increases computation time significantly. My question is if there is something I can alter (perhaps in the way I am generating those branches) to reduce the running time.

The function used to generate the columns is:

ROOT.gInterpreter.Declare('''
using Vec_t = const ROOT::RVec<float>&;
using namespace ROOT::Math;
ROOT::VecOps::RVec<PxPyPzEVector> TrueLep_Vecs(string variable, Vec_t Particle_PID, Vec_t Particle_E, Vec_t Particle_Px, Vec_t Particle_Py, Vec_t Particle_Pz, Vec_t Particle_Charge, Vec_t Particle_Status, Vec_t Particle_M1, Vec_t Particle_PT, Vec_t Particle_Eta) {
    
    ROOT::VecOps::RVec<PxPyPzEVector> TrueLep;
    ROOT::VecOps::RVec<PxPyPzEVector> TrueEle;
    ROOT::VecOps::RVec<PxPyPzEVector> TrueMuon;
    ROOT::VecOps::RVec<float> TrueLepPT;
    ROOT::VecOps::RVec<float> TrueElePT;
    ROOT::VecOps::RVec<float> TrueMuonPT;
    Long64_t nne = Particle_PID.size();
    
    //filling the vectors with 4-momentum and pT values
    for (int j=0; j<nne; j++){
            if(Particle_Status[j]==1 && fabs(Particle_PID[j])==11 && 
               (fabs(Particle_PID[Particle_M1[j]])==24 || fabs(Particle_PID[Particle_M1[j]])==23 || fabs(Particle_PID[Particle_M1[j]])==9900012 || fabs(Particle_PID[Particle_M1[j]])==15) &&
               Particle_PT[j]>10.0 
               && fabs(Particle_Eta[j])<2.47
               ){
                    TrueEle.emplace_back(PxPyPzEVector(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]));
                    TrueLep.emplace_back(PxPyPzEVector(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]));
                    TrueElePT.emplace_back(Particle_PT[j]);
                    TrueLepPT.emplace_back(Particle_PT[j]);
                    }
            if(Particle_Status[j]==1 && fabs(Particle_PID[j])==13 && 
               (fabs(Particle_PID[Particle_M1[j]])==24 || fabs(Particle_PID[Particle_M1[j]])==23 || fabs(Particle_PID[Particle_M1[j]])==9900012 || fabs(Particle_PID[Particle_M1[j]])==15) &&
               Particle_PT[j]>10.0 
               && fabs(Particle_Eta[j])<2.40
               ){
                    TrueMuon.emplace_back(PxPyPzEVector(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]));
                    TrueLep.emplace_back(PxPyPzEVector(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]));
                    TrueMuonPT.emplace_back(Particle_PT[j]);
                    TrueLepPT.emplace_back(Particle_PT[j]);
                    }
            }
                   
    //arranging all vectors in descending pT order
    auto TrueLep_sorted = Reverse(Take(TrueLep, Argsort(TrueLepPT)));
    auto TrueEle_sorted = Reverse(Take(TrueEle, Argsort(TrueElePT)));
    auto TrueMuon_sorted = Reverse(Take(TrueMuon, Argsort(TrueMuonPT))); 
    
    if(variable=="TrueLep"){return TrueLep_sorted;}
    if(variable=="TrueEle"){return TrueEle_sorted;}
    if(variable=="TrueMuon"){return TrueMuon_sorted;}
    else {ROOT::VecOps::RVec<PxPyPzEVector> Return; return Return;}
    
    }
''')

I then generate columns like:

df = df.Define('TrueLep', '''TrueLep_Vecs("TrueLep", Particle.PID, Particle.E, Particle.Px, Particle.Py, Particle.Pz, Particle.Charge, Particle.Status, Particle.M1, Particle.PT, Particle.Eta)''')

And finally apply filter like:

cut_pT = 'TrueLep[0].Pt()>25 && TrueLep[1].Pt()>20 && TrueLep[2].Pt()>15'
df = df.Filter(cut_pT)

Any help would be very much apperciated!

eguiraud · November 16, 2022, 2:32pm

Hi @yburkard ,

indeed there are some inefficiencies there.

Those Take and Reverse calls perform a lot of copies of the PxPyPzEVector objects. You can remove one copy by calling this instead: Take(TrueMuon, Reverse(Argsort(TrueMuonPT))).

TrueEle.emplace_back(PxPyPzEVector(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]))

also performs a copy, you probably want

TrueEle.emplace_back(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j])

instead.

you are performing these operations (creation, Reverse, Take) on TrueLep, TrueEle and TrueMuon, but it looks like in the end you throw away 2/3rds of the work and only return one of the 3. You can refactor the function so that, if you return TrueLep_sorted in the end, you only build TrueLep_sorted.
Code passed to ROOT.gInterpreter.Declare is compiled without optimizations until ROOT v6.26 and at optimization level O1 in v6.26, but here you probably want O2. The simplest way to compile that code with optimizations is to put it in a C++ source file, e.g. helpers.cpp, and then load it from the Python application with ROOT.gSystem.CompileMacro("helpers.cpp", "kO") (where the k option is to keep the result of the compilation so you don’t have to recompile every time, and O is to ask for optimized code)

If the code is still slow after these 4 optimizations, feel free to share a minimal self-contained reproducer (including an example input file) that we can run in order to investigate further.

Cheers,
Enrico

yburkard · November 16, 2022, 4:32pm

Many thanks for your quick reply @eguiraud! So I’ve implemented the first 3 optimisation and it reduced the execution time by about 14%. Now I wrote an additional file helper_functions.cpp simply including the definition of the function I had declared before, i.e. just with:

using Vec_t = const ROOT::RVec<float>&;
using namespace ROOT::Math;
ROOT::VecOps::RVec<PxPyPzEVector> TrueLep_Vecs(string variable, Vec_t Particle_PID, Vec_t Particle_E, Vec_t Particle_Px, Vec_t Particle_Py, Vec_t Particle_Pz, Vec_t Particle_Charge, Vec_t Particle_Status, Vec_t Particle_M1, Vec_t Particle_PT, Vec_t Particle_Eta) {
    
    ROOT::VecOps::RVec<PxPyPzEVector> TrueLep_Vec;
    ROOT::VecOps::RVec<float> TrueLepPT_Vec;
    Long64_t nne = Particle_PID.size();
    
    //filling the vectors with 4-momentum, charge and pT values
    for (int j=0; j<nne; j++){
            if(Particle_Status[j]==1 && fabs(Particle_PID[j])==11 && 
               (fabs(Particle_PID[Particle_M1[j]])==24 || fabs(Particle_PID[Particle_M1[j]])==23 || fabs(Particle_PID[Particle_M1[j]])==9900012 || fabs(Particle_PID[Particle_M1[j]])==15) &&
               Particle_PT[j]>10.0 
               && fabs(Particle_Eta[j])<2.47
               ){
                    if(variable=="TrueLep"){TrueLep_Vec.emplace_back(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]); TrueLepPT_Vec.emplace_back(Particle_PT[j]);}
                    if(variable=="TrueEle"){TrueLep_Vec.emplace_back(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]); TrueLepPT_Vec.emplace_back(Particle_PT[j]);}
                    }
            if(Particle_Status[j]==1 && fabs(Particle_PID[j])==13 && 
               (fabs(Particle_PID[Particle_M1[j]])==24 || fabs(Particle_PID[Particle_M1[j]])==23 || fabs(Particle_PID[Particle_M1[j]])==9900012 || fabs(Particle_PID[Particle_M1[j]])==15) &&
               Particle_PT[j]>10.0 
               && fabs(Particle_Eta[j])<2.40
               ){
                    if(variable=="TrueLep"){TrueLep_Vec.emplace_back(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]); TrueLepPT_Vec.emplace_back(Particle_PT[j]);}
                    if(variable=="TrueMuon"){TrueLep_Vec.emplace_back(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]); TrueLepPT_Vec.emplace_back(Particle_PT[j]);}
                    }
            }
                   
    //arranging all vectors in descending pT order
    auto TrueLep_Vec_sorted = Take(TrueLep_Vec, Reverse(Argsort(TrueLepPT_Vec)));
    
    return TrueLep_Vec_sorted;
    
    }

and I’ve added the line ROOT.gSystem.CompileMacro("helper_functions.cpp", "kO") as suggested, but this yields the following error:

In file included from input_line_9:6:
./helper_functions.cpp:7:29: error: no member named 'Math' in namespace 'ROOT'
float ComputeMT(T p1, ROOT::Math::PtEtaP...
                      ~~~~~~^
./helper_functions.cpp:15:27: error: no template named 'RVec' in namespace 'ROOT'
using Vec_t = const ROOT::RVec<float>&;
                    ~~~~~~^
[...]

This means that the Root objects are not being recognised; do I need to import or declare anything additional in the Macro?

eguiraud · November 16, 2022, 4:39pm

Reducing the amount of work done by that function by more than 2/3rds only reduced the total runtime by 14%? maybe that’s not the hotspot then. We can take a look at where the program spends its time with tools like perf.

You might have to add some #include, e.g. #include <ROOT/RVec.hxx> and #include <Math/Vector4D.h>.

RENATO_QUAGLIANI · November 16, 2022, 5:39pm

Just random suggestion, dont you better use const reference for all Vec_t arguments to avoid copies ?
Edit: i missed your First line or code where you do it.
On another point, maybe some extra reserve/resize of your vectors can help?

yburkard · November 16, 2022, 5:44pm

I’ve added the #include you suggested, and now I am having a problem with repeated function declarations (despite having deleted all of the declarations in the original code and doing it only in the helper_function.cpp macro). I am sending you a simple working code (without the macro implementation) as well as as a small example file, in case you want to have a look: LeptonCuts.py (6.7 KB)
exampleFile.root (951.0 KB)

eguiraud · November 17, 2022, 12:29pm

Hi @yburkard ,

thank you for the reproducer. The first important thing is that the following code performs 5 loops over the full dataset, one for each GetValue():

cut_2l = '(TrueEle.size()+TrueMuon.size()>=2)'
df = df.Filter(cut_2l)
print(df.Count().GetValue())

cut_pT = 'TrueLep[0].Pt()>15 && TrueLep[1].Pt()>10'
df = df.Filter(cut_pT)
print(df.Count().GetValue())

cut_charges = '(TrueEle.size()==2 && abs(TrueEleCharge[0]+TrueEleCharge[1])==0) || '+\
    '(TrueEle.size()==1 && abs(TrueMuonCharge[0])==1) || '+\
    '(TrueEle.size()==0 && abs(TrueMuonCharge[0]+TrueMuonCharge[1])==0)'
df = df.Filter(cut_charges)
print(df.Count().GetValue())

cut_mll = '(TrueLep[0]+TrueLep[1]).M()>12'
df = df.Filter(cut_mll)
print(df.Count().GetValue())

cut_btag = 'Jet.PT[Jet.PT>10 && Jet.BTag==1].size()==0'
df = df.Filter(cut_btag)
print(df.Count().GetValue())

Instead, you want to write your code so that you first register all operations to be performed, and then you access the results, see also the Introduction and the crash course in the user guide:

cut_2l = '(TrueEle.size()+TrueMuon.size()>=2)'
df = df.Filter(cut_2l)
c1 = df.Count()

cut_pT = 'TrueLep[0].Pt()>15 && TrueLep[1].Pt()>10'
df = df.Filter(cut_pT)
c2 = df.Count()

cut_charges = '(TrueEle.size()==2 && abs(TrueEleCharge[0]+TrueEleCharge[1])==0) || '+\
    '(TrueEle.size()==1 && abs(TrueMuonCharge[0])==1) || '+\
    '(TrueEle.size()==0 && abs(TrueMuonCharge[0]+TrueMuonCharge[1])==0)'
df = df.Filter(cut_charges)
c3 = df.Count()

cut_mll = '(TrueLep[0]+TrueLep[1]).M()>12'
df = df.Filter(cut_mll)
c4 = df.Count()

cut_btag = 'Jet.PT[Jet.PT>10 && Jet.BTag==1].size()==0'
df = df.Filter(cut_btag)
c5 = df.Count()

print([c.GetValue() for c in [c1, c2, c3, c4, c5]])
print("total execution time: ", time.time()-start_time)

With that fix, and CompileMacro instead of Declare, your program would look like this:

LeptonCuts.py (2.8 KB)
helpers.cpp (4.2 KB)

For this small input, runtime is completely dominated by setup time (in particular just-in-time compilation of the expressions that need to be executed during the RDF event loop). Running on a chain made of 1000 copies of that input file then gives me a processing rate of ~27 kEvts/s on a single thread on my laptop. The number will go up by adding ROOT.EnableImplicitMT() to run a multi-thread event loop (on my laptop, reading data from the filesystem cache, I got up to ~160 kEvts/s at 8 threads with 2k times the input file). If at this point things are still not fast enough we could do some simple performance profiling to check for hotspots.

Cheers,
Enrico

P.S.
you can get the actual event loop time directly from RDataFrame like it’s described in this section of the user guide. That way you can distinguish RDF setup time (constant w.r.t. the dataset size) from actual event loop time (which will scale roughly linearly with the number of events).

yburkard · November 24, 2022, 10:07am

Hi @eguiraud, apologies for the late reply. I’ve implemented the changes you suggested and they have reduced the execution time enormously (about 85%); what mainly contributed to this reduction was calling GetValue() only once – I was not aware that this command would loop over the full dataset; thank your very much for your help!

system · December 8, 2022, 10:07am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.