Filters Implemented with RDataFrame taking too long

Dear Users,

I am creating columns on RDataFrame object df containing arrays of Lorentz-vectors (namely ROOT::VecOps::RVec<PxPyPzEVector> objects) and then applying filters on them, but this is taking very long (for an RDataFrame object of 100’000 columns it takes about 30 minutes to apply the filters). After doing some inspection I realised that the new columns are generated every time a filter is applied, which I suppose increases computation time significantly. My question is if there is something I can alter (perhaps in the way I am generating those branches) to reduce the running time.

The function used to generate the columns is:

ROOT.gInterpreter.Declare('''
using Vec_t = const ROOT::RVec<float>&;
using namespace ROOT::Math;
ROOT::VecOps::RVec<PxPyPzEVector> TrueLep_Vecs(string variable, Vec_t Particle_PID, Vec_t Particle_E, Vec_t Particle_Px, Vec_t Particle_Py, Vec_t Particle_Pz, Vec_t Particle_Charge, Vec_t Particle_Status, Vec_t Particle_M1, Vec_t Particle_PT, Vec_t Particle_Eta) {
    
    ROOT::VecOps::RVec<PxPyPzEVector> TrueLep;
    ROOT::VecOps::RVec<PxPyPzEVector> TrueEle;
    ROOT::VecOps::RVec<PxPyPzEVector> TrueMuon;
    ROOT::VecOps::RVec<float> TrueLepPT;
    ROOT::VecOps::RVec<float> TrueElePT;
    ROOT::VecOps::RVec<float> TrueMuonPT;
    Long64_t nne = Particle_PID.size();
    
    //filling the vectors with 4-momentum and pT values
    for (int j=0; j<nne; j++){
            if(Particle_Status[j]==1 && fabs(Particle_PID[j])==11 && 
               (fabs(Particle_PID[Particle_M1[j]])==24 || fabs(Particle_PID[Particle_M1[j]])==23 || fabs(Particle_PID[Particle_M1[j]])==9900012 || fabs(Particle_PID[Particle_M1[j]])==15) &&
               Particle_PT[j]>10.0 
               && fabs(Particle_Eta[j])<2.47
               ){
                    TrueEle.emplace_back(PxPyPzEVector(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]));
                    TrueLep.emplace_back(PxPyPzEVector(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]));
                    TrueElePT.emplace_back(Particle_PT[j]);
                    TrueLepPT.emplace_back(Particle_PT[j]);
                    }
            if(Particle_Status[j]==1 && fabs(Particle_PID[j])==13 && 
               (fabs(Particle_PID[Particle_M1[j]])==24 || fabs(Particle_PID[Particle_M1[j]])==23 || fabs(Particle_PID[Particle_M1[j]])==9900012 || fabs(Particle_PID[Particle_M1[j]])==15) &&
               Particle_PT[j]>10.0 
               && fabs(Particle_Eta[j])<2.40
               ){
                    TrueMuon.emplace_back(PxPyPzEVector(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]));
                    TrueLep.emplace_back(PxPyPzEVector(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]));
                    TrueMuonPT.emplace_back(Particle_PT[j]);
                    TrueLepPT.emplace_back(Particle_PT[j]);
                    }
            }
                   
    //arranging all vectors in descending pT order
    auto TrueLep_sorted = Reverse(Take(TrueLep, Argsort(TrueLepPT)));
    auto TrueEle_sorted = Reverse(Take(TrueEle, Argsort(TrueElePT)));
    auto TrueMuon_sorted = Reverse(Take(TrueMuon, Argsort(TrueMuonPT))); 
    
    if(variable=="TrueLep"){return TrueLep_sorted;}
    if(variable=="TrueEle"){return TrueEle_sorted;}
    if(variable=="TrueMuon"){return TrueMuon_sorted;}
    else {ROOT::VecOps::RVec<PxPyPzEVector> Return; return Return;}
    
    }
''')

I then generate columns like:

df = df.Define('TrueLep', '''TrueLep_Vecs("TrueLep", Particle.PID, Particle.E, Particle.Px, Particle.Py, Particle.Pz, Particle.Charge, Particle.Status, Particle.M1, Particle.PT, Particle.Eta)''')

And finally apply filter like:

cut_pT = 'TrueLep[0].Pt()>25 && TrueLep[1].Pt()>20 && TrueLep[2].Pt()>15'
df = df.Filter(cut_pT)

Any help would be very much apperciated!

Hi @yburkard ,

indeed there are some inefficiencies there.

  1. Those Take and Reverse calls perform a lot of copies of the PxPyPzEVector objects. You can remove one copy by calling this instead: Take(TrueMuon, Reverse(Argsort(TrueMuonPT))).

TrueEle.emplace_back(PxPyPzEVector(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]))

also performs a copy, you probably want

TrueEle.emplace_back(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j])

instead.

  1. you are performing these operations (creation, Reverse, Take) on TrueLep, TrueEle and TrueMuon, but it looks like in the end you throw away 2/3rds of the work and only return one of the 3. You can refactor the function so that, if you return TrueLep_sorted in the end, you only build TrueLep_sorted.

  2. Code passed to ROOT.gInterpreter.Declare is compiled without optimizations until ROOT v6.26 and at optimization level O1 in v6.26, but here you probably want O2. The simplest way to compile that code with optimizations is to put it in a C++ source file, e.g. helpers.cpp, and then load it from the Python application with ROOT.gSystem.CompileMacro("helpers.cpp", "kO") (where the k option is to keep the result of the compilation so you don’t have to recompile every time, and O is to ask for optimized code)

If the code is still slow after these 4 optimizations, feel free to share a minimal self-contained reproducer (including an example input file) that we can run in order to investigate further.

Cheers,
Enrico

1 Like

Many thanks for your quick reply @eguiraud! So I’ve implemented the first 3 optimisation and it reduced the execution time by about 14%. Now I wrote an additional file helper_functions.cpp simply including the definition of the function I had declared before, i.e. just with:

using Vec_t = const ROOT::RVec<float>&;
using namespace ROOT::Math;
ROOT::VecOps::RVec<PxPyPzEVector> TrueLep_Vecs(string variable, Vec_t Particle_PID, Vec_t Particle_E, Vec_t Particle_Px, Vec_t Particle_Py, Vec_t Particle_Pz, Vec_t Particle_Charge, Vec_t Particle_Status, Vec_t Particle_M1, Vec_t Particle_PT, Vec_t Particle_Eta) {
    
    ROOT::VecOps::RVec<PxPyPzEVector> TrueLep_Vec;
    ROOT::VecOps::RVec<float> TrueLepPT_Vec;
    Long64_t nne = Particle_PID.size();
    
    //filling the vectors with 4-momentum, charge and pT values
    for (int j=0; j<nne; j++){
            if(Particle_Status[j]==1 && fabs(Particle_PID[j])==11 && 
               (fabs(Particle_PID[Particle_M1[j]])==24 || fabs(Particle_PID[Particle_M1[j]])==23 || fabs(Particle_PID[Particle_M1[j]])==9900012 || fabs(Particle_PID[Particle_M1[j]])==15) &&
               Particle_PT[j]>10.0 
               && fabs(Particle_Eta[j])<2.47
               ){
                    if(variable=="TrueLep"){TrueLep_Vec.emplace_back(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]); TrueLepPT_Vec.emplace_back(Particle_PT[j]);}
                    if(variable=="TrueEle"){TrueLep_Vec.emplace_back(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]); TrueLepPT_Vec.emplace_back(Particle_PT[j]);}
                    }
            if(Particle_Status[j]==1 && fabs(Particle_PID[j])==13 && 
               (fabs(Particle_PID[Particle_M1[j]])==24 || fabs(Particle_PID[Particle_M1[j]])==23 || fabs(Particle_PID[Particle_M1[j]])==9900012 || fabs(Particle_PID[Particle_M1[j]])==15) &&
               Particle_PT[j]>10.0 
               && fabs(Particle_Eta[j])<2.40
               ){
                    if(variable=="TrueLep"){TrueLep_Vec.emplace_back(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]); TrueLepPT_Vec.emplace_back(Particle_PT[j]);}
                    if(variable=="TrueMuon"){TrueLep_Vec.emplace_back(Particle_Px[j], Particle_Py[j], Particle_Pz[j], Particle_E[j]); TrueLepPT_Vec.emplace_back(Particle_PT[j]);}
                    }
            }
                   
    //arranging all vectors in descending pT order
    auto TrueLep_Vec_sorted = Take(TrueLep_Vec, Reverse(Argsort(TrueLepPT_Vec)));
    
    return TrueLep_Vec_sorted;
    
    }

and I’ve added the line ROOT.gSystem.CompileMacro("helper_functions.cpp", "kO") as suggested, but this yields the following error:

In file included from input_line_9:6:
./helper_functions.cpp:7:29: error: no member named 'Math' in namespace 'ROOT'
float ComputeMT(T p1, ROOT::Math::PtEtaP...
                      ~~~~~~^
./helper_functions.cpp:15:27: error: no template named 'RVec' in namespace 'ROOT'
using Vec_t = const ROOT::RVec<float>&;
                    ~~~~~~^
[...]

This means that the Root objects are not being recognised; do I need to import or declare anything additional in the Macro?

Reducing the amount of work done by that function by more than 2/3rds only reduced the total runtime by 14%? :thinking: maybe that’s not the hotspot then. We can take a look at where the program spends its time with tools like perf.

You might have to add some #include, e.g. #include <ROOT/RVec.hxx> and #include <Math/Vector4D.h>.

1 Like

Just random suggestion, dont you better use const reference for all Vec_t arguments to avoid copies ?
Edit: i missed your First line or code where you do it.
On another point, maybe some extra reserve/resize of your vectors can help?

1 Like

I’ve added the #include you suggested, and now I am having a problem with repeated function declarations (despite having deleted all of the declarations in the original code and doing it only in the helper_function.cpp macro). I am sending you a simple working code (without the macro implementation) as well as as a small example file, in case you want to have a look: LeptonCuts.py (6.7 KB)
exampleFile.root (951.0 KB)

Hi @yburkard ,

thank you for the reproducer. The first important thing is that the following code performs 5 loops over the full dataset, one for each GetValue():

cut_2l = '(TrueEle.size()+TrueMuon.size()>=2)'
df = df.Filter(cut_2l)
print(df.Count().GetValue())

cut_pT = 'TrueLep[0].Pt()>15 && TrueLep[1].Pt()>10'
df = df.Filter(cut_pT)
print(df.Count().GetValue())

cut_charges = '(TrueEle.size()==2 && abs(TrueEleCharge[0]+TrueEleCharge[1])==0) || '+\
    '(TrueEle.size()==1 && abs(TrueMuonCharge[0])==1) || '+\
    '(TrueEle.size()==0 && abs(TrueMuonCharge[0]+TrueMuonCharge[1])==0)'
df = df.Filter(cut_charges)
print(df.Count().GetValue())

cut_mll = '(TrueLep[0]+TrueLep[1]).M()>12'
df = df.Filter(cut_mll)
print(df.Count().GetValue())

cut_btag = 'Jet.PT[Jet.PT>10 && Jet.BTag==1].size()==0'
df = df.Filter(cut_btag)
print(df.Count().GetValue())

Instead, you want to write your code so that you first register all operations to be performed, and then you access the results, see also the Introduction and the crash course in the user guide:

cut_2l = '(TrueEle.size()+TrueMuon.size()>=2)'
df = df.Filter(cut_2l)
c1 = df.Count()

cut_pT = 'TrueLep[0].Pt()>15 && TrueLep[1].Pt()>10'
df = df.Filter(cut_pT)
c2 = df.Count()

cut_charges = '(TrueEle.size()==2 && abs(TrueEleCharge[0]+TrueEleCharge[1])==0) || '+\
    '(TrueEle.size()==1 && abs(TrueMuonCharge[0])==1) || '+\
    '(TrueEle.size()==0 && abs(TrueMuonCharge[0]+TrueMuonCharge[1])==0)'
df = df.Filter(cut_charges)
c3 = df.Count()

cut_mll = '(TrueLep[0]+TrueLep[1]).M()>12'
df = df.Filter(cut_mll)
c4 = df.Count()

cut_btag = 'Jet.PT[Jet.PT>10 && Jet.BTag==1].size()==0'
df = df.Filter(cut_btag)
c5 = df.Count()

print([c.GetValue() for c in [c1, c2, c3, c4, c5]])
print("total execution time: ", time.time()-start_time)

With that fix, and CompileMacro instead of Declare, your program would look like this:

LeptonCuts.py (2.8 KB)
helpers.cpp (4.2 KB)

For this small input, runtime is completely dominated by setup time (in particular just-in-time compilation of the expressions that need to be executed during the RDF event loop). Running on a chain made of 1000 copies of that input file then gives me a processing rate of ~27 kEvts/s on a single thread on my laptop. The number will go up by adding ROOT.EnableImplicitMT() to run a multi-thread event loop (on my laptop, reading data from the filesystem cache, I got up to ~160 kEvts/s at 8 threads with 2k times the input file). If at this point things are still not fast enough we could do some simple performance profiling to check for hotspots.

Cheers,
Enrico

P.S.
you can get the actual event loop time directly from RDataFrame like it’s described in this section of the user guide. That way you can distinguish RDF setup time (constant w.r.t. the dataset size) from actual event loop time (which will scale roughly linearly with the number of events).

1 Like

Hi @eguiraud, apologies for the late reply. I’ve implemented the changes you suggested and they have reduced the execution time enormously (about 85%); what mainly contributed to this reduction was calling GetValue() only once – I was not aware that this command would loop over the full dataset; thank your very much for your help! :slight_smile: