Canonical Way to group-by in RDataFrame

Hello,
Happy New Year!

I am working on implementing group-by-like counting for my analyses using ROOT’s RDataFrame. However, I have encountered significant performance challenges with my current approach. I wanted to check if the method I am using is the canonical way to achieve this or if there are optimizations, I might be missing that could improve its efficiency.

I have included a minimal reproducible example of my current implementation:

# %%
import ROOT
import numpy as np
# %%
df = ROOT.RDataFrame(100)
df = df.Define("angle", "gRandom->Uniform(0, 3.14)")
df = df.Define("nTracks", "gRandom->Integer(3)")
# %%
#  The idea is to bin the angles and find the number of events with 0, 1, 2, 3 tracks in each bin
bins = np.linspace(0, 3.14, 5, dtype=np.double)
model = ROOT.RDF.TH1DModel("angle", "angle", bins.size - 1, bins)
angle_hist = df.Histo1D(model, "angle")

# %%
count_two_tracks = np.zeros(bins.size - 1, dtype=np.double)
count_one_track = np.zeros(bins.size - 1, dtype=np.double)
count_zero_track = np.zeros(bins.size - 1, dtype=np.double)
for bin_idx in range(bins.size - 1):
    bin_low = bins[bin_idx]
    bin_high = bins[bin_idx+1]
    print(f"Processing bin ({bin_idx}):\t{bin_low:.3f} - {bin_high:.3f}")
    rdf_bin = df.Filter(f"angle > {bin_low} && angle < {bin_high}", f"{bin_low} < angle < {bin_high}")
    count_two_tracks[bin_idx] = rdf_bin.Filter("nTracks == 2").Count().GetValue()
    count_one_track[bin_idx] = rdf_bin.Filter("nTracks == 1").Count().GetValue()
    count_zero_track[bin_idx] = rdf_bin.Filter("nTracks == 0").Count().GetValue()
# %%
# Make histograms of counts
two_track_hist = ROOT.TH1D("two_track_hist", "two_track_hist", bins.size - 1, bins)
for idx, bin_count in enumerate(count_two_tracks):
    two_track_hist.SetBinContent(idx+1, bin_count)

The goal is to bin angles and count events with 0, 1, 2, and 3 tracks in each bin. While the logic works as expected, the performance is suboptimal, particularly when scaling to larger datasets.
Are there any suggestions for improving the implementation?

Thanks.


Please read tips for efficient and successful posting and posting code

Please fill also the fields below. Note that root -b -q will tell you this info, and starting from 6.28/06 upwards, you can call .forum bug from the ROOT prompt to pre-populate a topic.

ROOT Version: 6.35.01
Platform: linuxx8664gcc
Compiler: Not Provided


Happy New Year to you too!
I guess @vpadulan can help, but with maybe some delay…

1 Like

Edit: Changed the position of the GetValue() call to after the for loop,

# Make the count_two_tracks a list containing the RDFResults to invoke the `GetValue` post the for loop.
count_two_tracks = [count.GetValue() for count in count_two_tracks]

Performance is still not great.

Hi,

Thanks for the post.
What I think can be improved are these lines:

count_two_tracks[bin_idx] = rdf_bin.Filter("nTracks == 2").Count().GetValue()
count_one_track[bin_idx] = rdf_bin.Filter("nTracks == 1").Count().GetValue()
count_zero_track[bin_idx] = rdf_bin.Filter("nTracks == 0").Count().GetValue()

For every line, which is also in a loop, you are running a loop on your dataset, triggered by GetValue().
What I’d suggest is to save the result pointers into the collecations, to then access their values outside the loop.

Full example below: if I made mistakes understanding the code, please bear with me!

Cheers,
D

# %%
import ROOT
import numpy as np
# %%
df = ROOT.RDataFrame(100)
df = df.Define("angle", "gRandom->Uniform(0, 3.14)")
df = df.Define("nTracks", "gRandom->Integer(3)")
# %%
#  The idea is to bin the angles and find the number of events with 0, 1, 2, 3 tracks in each bin
bins = np.linspace(0, 3.14, 5, dtype=np.double)
model = ROOT.RDF.TH1DModel("angle", "angle", bins.size - 1, bins)
angle_hist = df.Histo1D(model, "angle")

# %%
count_two_tracks = []
count_one_track = []
count_zero_track = []
for bin_idx in range(bins.size - 1):
    bin_low = bins[bin_idx]
    bin_high = bins[bin_idx+1]
    print(f"Processing bin ({bin_idx}):\t{bin_low:.3f} - {bin_high:.3f}")
    rdf_bin = df.Filter(f"angle > {bin_low} && angle < {bin_high}", f"{bin_low} < angle < {bin_high}")
    count_two_tracks.append(rdf_bin.Filter("nTracks == 2").Count())
    count_one_track.append(rdf_bin.Filter("nTracks == 1").Count())
    count_zero_track.append(rdf_bin.Filter("nTracks == 0").Count())

# %%
# Make histograms of counts
two_track_hist = ROOT.TH1D("two_track_hist", "two_track_hist", bins.size - 1, bins)
for idx, bin_count in enumerate(count_two_tracks):
    two_track_hist.SetBinContent(idx+1, bin_count.GetValue())