Groupby in RDataFrame

mwilkins · June 7, 2022, 2:29pm

These days, I do almost all of my analysis using RDataFrame in pyROOT, which is, overall, a very robust and easy-to-use framework. After many many hours with it, my biggest complaint is the lack of groupby capability, which forces me to introduce a pandas dependency and temporarily leave ROOT using the (awkward and slow) AsNumpy and MakeNumpyDataFrame functions.

@eguiraud’s excellent talk to the LHCb collaboration today on RDataFrame makes clear the need for such a feature. Many of the examples he gave in the talk assume that each event is assigned a single row in the datasource, but this is not how standard LHCb software produces TTrees. In LHCb, each candidate particle is assigned a row, so if you want to compare candidates within a given event, a feature like groupby (or TTreeIndex) is essential to enable nifty features like index sorting and so forth.

What I would like to see is something like

df = RDataFrame(...)
gdf = df.groupby(["runNumber", "eventNumber"], "ncands")
h_ncands = gdf.Histo1D("ncands")  # histogram the number of candidates per event
h_pt = gdf.Histo1D("muon_pt")  # identical output to df.Histo1D("muon_pt")
h_pt_first = gdf.Histo1D("muon_pt[0]")  # histogram the pt of the first muon in each event

where groupby creates a new RDataFrame gdf with a number of rows equal to the number of unique runNumber and eventNumber combinations. The columns in gdf have the same names as the columns in df but are now RVecs of length ncands, which is a new column of integers representing the number of rows in df with a given combination of runNumber and eventNumber. The runNumber and eventNumber columns alone are not replaced with RVecs and retain their types and values from df.

This has come up in a couple other posts (here and here), but since @eguiraud solicited feedback in today’s talk, I thought it would be useful to focus the discussion in a new post.

wiso · June 7, 2022, 4:24pm

I just want to confirm that this would be a very nice feature. I understand it may be complicated to implement.
I worked in LHCb >10 years ago, good to see that 1row = 1particle is still used.

eguiraud · June 8, 2022, 7:54am

Hi @mwilkins , @wiso ,

thank you for the feedback, much appreciated. I especially appreciate that you put actual thought in how the feature would look like and work!

I hear you, and I agree it would be nice to have this, the two main challenges are implementing this in a way that does not require that the whole dataset fits in memory and then figuring out how to actually make it work with RDF’s internal implementation, which is not thought for something like this. Point take though!

Cheers,
Enrico

wiso · June 8, 2022, 8:16am

Right, I expected this to be complicated, and I think you cannot assume the dataset to be sorted by the grouping key (also because the grouping key could be an expression).

By the way dask can do that, and it should be out of memory and distributed: dask-examples/02-groupby.ipynb at main · dask/dask-examples · GitHub

Axel · June 9, 2022, 7:52am

The performance (memory + CPU) of this depends very much on the aggregate. Their doc says

These are generally fairly efficient, assuming that the number of groups is small (less than a million).

That won’t be enough for many HEP use cases… but we can certainly take some inspiration from what they do!