I know there is a way to make an event filter in RDataFrame by defining a function. However, I want to know if there have some methods to slim the branch array as well.
Let me use the example of nMuon branch,
The dataset Events is a TTree (a “table” in first-order) and has following branches (also refered to as “columns”):
Branch name
Data type
Description
nMuon
unsigned int
Number of muons in this event
Muon_pt
float[nMuon]
Transverse momentum of the muons stored as an array of size nMuon
Muon_eta
float[nMuon]
Pseudo-rapidity of the muons stored as an array of size nMuon
Muon_phi
float[nMuon]
Azimuth of the muons stored as an array of size nMuon
Muon_charge
int[nMuon]
Charge of the muons stored as an array of size nMuon and either -1 or 1
Muon_mass
float[nMuon]
Mass of the muons stored as an array of size nMuon
For instance, how should I do if I only want to keep the array elements that ‘Muon_pt[x] == 10’ and throw away all the rest elements in Muon_pt[nMuon] array(also do the same for other Muon_xx[nMuon] arrays?
I am not sure if I describe my question clearly or not, in my real case, the size of the branch array is rather big and I want to slim those array by applying some selections/filters.
You have to define a new column (=branch) with the slimmed version of the collection. See the following snipplet!
import ROOT
# Make dataframe from part of the original file
filename = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
df = ROOT.RDataFrame("Events", filename).Range(1000)
# Slim down the branches
df2 = df.Define("mask", "Muon_pt > 10")\
.Define("nSlimmedMuon", "Sum(mask)")\
.Define("SlimmedMuon_pt", "Muon_pt[mask]")\
.Define("SlimmedMuon_eta", "Muon_eta[mask]")
# Snapshot the branches to a new file
columns = ROOT.std.vector("string")()
for c in ("nMuon", "Muon_pt", "Muon_eta", "nSlimmedMuon", "SlimmedMuon_pt", "SlimmedMuon_eta"):
columns.push_back(c)
df2.Snapshot("Events", "slimmed.root", columns)
Hi Stefan, thank you very much. I tried your method this afternoon and it looks work for my case.
Just one more thing, it is possible to keep the origin branch names instead of defining the new names?
The string Muon_pt > 10 in the first Define call creates the mask, which is basically a vector<int> with ones and zeros. This is possible, because RDataFrame adopts std::vector as ROOT::RVec, which has these numpy-like features on top of the interface of a std::vector.
Hi,
not possible at the moment, but we want to introduce Redefine in the future to hide behind a new definition. You can follow the ticket here, although at the moment it’s not very high-priority.