How to slim the branch array in RDataFrame

I know there is a way to make an event filter in RDataFrame by defining a function. However, I want to know if there have some methods to slim the branch array as well.

Let me use the example of nMuon branch,

The dataset Events is a TTree (a “table” in first-order) and has following branches (also refered to as “columns”):

Branch name Data type Description
nMuon unsigned int Number of muons in this event
Muon_pt float[nMuon] Transverse momentum of the muons stored as an array of size nMuon
Muon_eta float[nMuon] Pseudo-rapidity of the muons stored as an array of size nMuon
Muon_phi float[nMuon] Azimuth of the muons stored as an array of size nMuon
Muon_charge int[nMuon] Charge of the muons stored as an array of size nMuon and either -1 or 1
Muon_mass float[nMuon] Mass of the muons stored as an array of size nMuon

For instance, how should I do if I only want to keep the array elements that ‘Muon_pt[x] == 10’ and throw away all the rest elements in Muon_pt[nMuon] array(also do the same for other Muon_xx[nMuon] arrays?
I am not sure if I describe my question clearly or not, in my real case, the size of the branch array is rather big and I want to slim those array by applying some selections/filters.


Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided


@eguiraud Do you have a good suggestion?

Hi!

You have to define a new column (=branch) with the slimmed version of the collection. See the following snipplet!

import ROOT

# Make dataframe from part of the original file
filename = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
df = ROOT.RDataFrame("Events", filename).Range(1000)

# Slim down the branches
df2 = df.Define("mask", "Muon_pt > 10")\
        .Define("nSlimmedMuon", "Sum(mask)")\
        .Define("SlimmedMuon_pt", "Muon_pt[mask]")\
        .Define("SlimmedMuon_eta", "Muon_eta[mask]")

# Snapshot the branches to a new file
columns = ROOT.std.vector("string")()
for c in ("nMuon", "Muon_pt", "Muon_eta", "nSlimmedMuon", "SlimmedMuon_pt", "SlimmedMuon_eta"):
    columns.push_back(c)
df2.Snapshot("Events", "slimmed.root", columns)

Is this the solution you are looking for?

Best
Stefan

Hi Stefan, thank you very much. I tried your method this afternoon and it looks work for my case.
Just one more thing, it is possible to keep the origin branch names instead of defining the new names?

Good question! But I don’t see a way to keep them unfortunately. Probably @eguiraud knows a trick to do so?

I just want to clarify why this works:

The string Muon_pt > 10 in the first Define call creates the mask, which is basically a vector<int> with ones and zeros. This is possible, because RDataFrame adopts std::vector as ROOT::RVec, which has these numpy-like features on top of the interface of a std::vector.

See the docs for RVec here.

Hi,
not possible at the moment, but we want to introduce Redefine in the future to hide behind a new definition. You can follow the ticket here, although at the moment it’s not very high-priority.

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.