These days, I do almost all of my analysis using RDataFrame in pyROOT, which is, overall, a very robust and easy-to-use framework. After many many hours with it, my biggest complaint is the lack of groupby
capability, which forces me to introduce a pandas dependency and temporarily leave ROOT using the (awkward and slow) AsNumpy
and MakeNumpyDataFrame
functions.
@eguiraud’s excellent talk to the LHCb collaboration today on RDataFrame
makes clear the need for such a feature. Many of the examples he gave in the talk assume that each event is assigned a single row in the datasource, but this is not how standard LHCb software produces TTrees. In LHCb, each candidate particle is assigned a row, so if you want to compare candidates within a given event, a feature like groupby
(or TTreeIndex
) is essential to enable nifty features like index sorting and so forth.
What I would like to see is something like
df = RDataFrame(...)
gdf = df.groupby(["runNumber", "eventNumber"], "ncands")
h_ncands = gdf.Histo1D("ncands") # histogram the number of candidates per event
h_pt = gdf.Histo1D("muon_pt") # identical output to df.Histo1D("muon_pt")
h_pt_first = gdf.Histo1D("muon_pt[0]") # histogram the pt of the first muon in each event
where groupby
creates a new RDataFrame gdf
with a number of rows equal to the number of unique runNumber and eventNumber combinations. The columns in gdf
have the same names as the columns in df
but are now RVecs of length ncands, which is a new column of integers representing the number of rows in df
with a given combination of runNumber and eventNumber. The runNumber and eventNumber columns alone are not replaced with RVecs and retain their types and values from df
.
This has come up in a couple other posts (here and here), but since @eguiraud solicited feedback in today’s talk, I thought it would be useful to focus the discussion in a new post.