How to slim the branch array in RDataFrame

yanzhepg · January 30, 2020, 8:10am

I know there is a way to make an event filter in RDataFrame by defining a function. However, I want to know if there have some methods to slim the branch array as well.

Let me use the example of nMuon branch,

The dataset Events is a TTree (a “table” in first-order) and has following branches (also refered to as “columns”):

Branch name	Data type	Description
`nMuon`	`unsigned int`	Number of muons in this event
`Muon_pt`	`float[nMuon]`	Transverse momentum of the muons stored as an array of size `nMuon`
`Muon_eta`	`float[nMuon]`	Pseudo-rapidity of the muons stored as an array of size `nMuon`
`Muon_phi`	`float[nMuon]`	Azimuth of the muons stored as an array of size `nMuon`
`Muon_charge`	`int[nMuon]`	Charge of the muons stored as an array of size `nMuon` and either -1 or 1
`Muon_mass`	`float[nMuon]`	Mass of the muons stored as an array of size `nMuon`

For instance, how should I do if I only want to keep the array elements that ‘Muon_pt[x] == 10’ and throw away all the rest elements in Muon_pt[nMuon] array(also do the same for other Muon_xx[nMuon] arrays?
I am not sure if I describe my question clearly or not, in my real case, the size of the branch array is rather big and I want to slim those array by applying some selections/filters.

Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided

jblomer · January 30, 2020, 10:21am

@eguiraud Do you have a good suggestion?

swunsch · January 30, 2020, 12:06pm

Hi!

You have to define a new column (=branch) with the slimmed version of the collection. See the following snipplet!

import ROOT

# Make dataframe from part of the original file
filename = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
df = ROOT.RDataFrame("Events", filename).Range(1000)

# Slim down the branches
df2 = df.Define("mask", "Muon_pt > 10")\
        .Define("nSlimmedMuon", "Sum(mask)")\
        .Define("SlimmedMuon_pt", "Muon_pt[mask]")\
        .Define("SlimmedMuon_eta", "Muon_eta[mask]")

# Snapshot the branches to a new file
columns = ROOT.std.vector("string")()
for c in ("nMuon", "Muon_pt", "Muon_eta", "nSlimmedMuon", "SlimmedMuon_pt", "SlimmedMuon_eta"):
    columns.push_back(c)
df2.Snapshot("Events", "slimmed.root", columns)

Is this the solution you are looking for?

Best
Stefan

yanzhepg · January 30, 2020, 6:46pm

Hi Stefan, thank you very much. I tried your method this afternoon and it looks work for my case.
Just one more thing, it is possible to keep the origin branch names instead of defining the new names?

swunsch · January 30, 2020, 9:09pm

Good question! But I don’t see a way to keep them unfortunately. Probably @eguiraud knows a trick to do so?

swunsch · January 30, 2020, 9:13pm

I just want to clarify why this works:

The string Muon_pt > 10 in the first Define call creates the mask, which is basically a vector<int> with ones and zeros. This is possible, because RDataFrame adopts std::vector as ROOT::RVec, which has these numpy-like features on top of the interface of a std::vector.

See the docs for RVec here.

eguiraud · January 30, 2020, 9:34pm

Hi,
not possible at the moment, but we want to introduce Redefine in the future to hide behind a new definition. You can follow the ticket here, although at the moment it’s not very high-priority.

Cheers,
Enrico

system · February 13, 2020, 9:34pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.