Canonical Way to group-by in RDataFrame

lost_soul_519 · January 6, 2025, 11:08am

Hello,
Happy New Year!

I am working on implementing group-by-like counting for my analyses using ROOT’s RDataFrame. However, I have encountered significant performance challenges with my current approach. I wanted to check if the method I am using is the canonical way to achieve this or if there are optimizations, I might be missing that could improve its efficiency.

I have included a minimal reproducible example of my current implementation:

# %%
import ROOT
import numpy as np
# %%
df = ROOT.RDataFrame(100)
df = df.Define("angle", "gRandom->Uniform(0, 3.14)")
df = df.Define("nTracks", "gRandom->Integer(3)")
# %%
#  The idea is to bin the angles and find the number of events with 0, 1, 2, 3 tracks in each bin
bins = np.linspace(0, 3.14, 5, dtype=np.double)
model = ROOT.RDF.TH1DModel("angle", "angle", bins.size - 1, bins)
angle_hist = df.Histo1D(model, "angle")

# %%
count_two_tracks = np.zeros(bins.size - 1, dtype=np.double)
count_one_track = np.zeros(bins.size - 1, dtype=np.double)
count_zero_track = np.zeros(bins.size - 1, dtype=np.double)
for bin_idx in range(bins.size - 1):
    bin_low = bins[bin_idx]
    bin_high = bins[bin_idx+1]
    print(f"Processing bin ({bin_idx}):\t{bin_low:.3f} - {bin_high:.3f}")
    rdf_bin = df.Filter(f"angle > {bin_low} && angle < {bin_high}", f"{bin_low} < angle < {bin_high}")
    count_two_tracks[bin_idx] = rdf_bin.Filter("nTracks == 2").Count().GetValue()
    count_one_track[bin_idx] = rdf_bin.Filter("nTracks == 1").Count().GetValue()
    count_zero_track[bin_idx] = rdf_bin.Filter("nTracks == 0").Count().GetValue()
# %%
# Make histograms of counts
two_track_hist = ROOT.TH1D("two_track_hist", "two_track_hist", bins.size - 1, bins)
for idx, bin_count in enumerate(count_two_tracks):
    two_track_hist.SetBinContent(idx+1, bin_count)

The goal is to bin angles and count events with 0, 1, 2, and 3 tracks in each bin. While the logic works as expected, the performance is suboptimal, particularly when scaling to larger datasets.
Are there any suggestions for improving the implementation?

Thanks.

Please read tips for efficient and successful posting and posting code

Please fill also the fields below. Note that root -b -q will tell you this info, and starting from 6.28/06 upwards, you can call .forum bug from the ROOT prompt to pre-populate a topic.

ROOT Version: 6.35.01
Platform: linuxx8664gcc
Compiler: Not Provided

bellenot · January 6, 2025, 4:05pm

Happy New Year to you too!
I guess @vpadulan can help, but with maybe some delay…

lost_soul_519 · January 13, 2025, 11:18am

Edit: Changed the position of the GetValue() call to after the for loop,

# Make the count_two_tracks a list containing the RDFResults to invoke the `GetValue` post the for loop.
count_two_tracks = [count.GetValue() for count in count_two_tracks]

Performance is still not great.

Danilo · January 13, 2025, 8:35pm

Hi,

Thanks for the post.
What I think can be improved are these lines:

count_two_tracks[bin_idx] = rdf_bin.Filter("nTracks == 2").Count().GetValue()
count_one_track[bin_idx] = rdf_bin.Filter("nTracks == 1").Count().GetValue()
count_zero_track[bin_idx] = rdf_bin.Filter("nTracks == 0").Count().GetValue()

For every line, which is also in a loop, you are running a loop on your dataset, triggered by GetValue().
What I’d suggest is to save the result pointers into the collecations, to then access their values outside the loop.

Full example below: if I made mistakes understanding the code, please bear with me!

Cheers,
D

# %%
import ROOT
import numpy as np
# %%
df = ROOT.RDataFrame(100)
df = df.Define("angle", "gRandom->Uniform(0, 3.14)")
df = df.Define("nTracks", "gRandom->Integer(3)")
# %%
#  The idea is to bin the angles and find the number of events with 0, 1, 2, 3 tracks in each bin
bins = np.linspace(0, 3.14, 5, dtype=np.double)
model = ROOT.RDF.TH1DModel("angle", "angle", bins.size - 1, bins)
angle_hist = df.Histo1D(model, "angle")

# %%
count_two_tracks = []
count_one_track = []
count_zero_track = []
for bin_idx in range(bins.size - 1):
    bin_low = bins[bin_idx]
    bin_high = bins[bin_idx+1]
    print(f"Processing bin ({bin_idx}):\t{bin_low:.3f} - {bin_high:.3f}")
    rdf_bin = df.Filter(f"angle > {bin_low} && angle < {bin_high}", f"{bin_low} < angle < {bin_high}")
    count_two_tracks.append(rdf_bin.Filter("nTracks == 2").Count())
    count_one_track.append(rdf_bin.Filter("nTracks == 1").Count())
    count_zero_track.append(rdf_bin.Filter("nTracks == 0").Count())

# %%
# Make histograms of counts
two_track_hist = ROOT.TH1D("two_track_hist", "two_track_hist", bins.size - 1, bins)
for idx, bin_count in enumerate(count_two_tracks):
    two_track_hist.SetBinContent(idx+1, bin_count.GetValue())

lost_soul_519 · January 17, 2025, 12:37pm

Hi Danilo,

Thanks for catching the event loop. I did make that modification, but somehow, I think the concern concerning the performance is still valid. At least in the way I am trying to do it.

I tried making a version of the code using TTrees. And the hit is significant? Let me know if I am testing it wrong.

❯ sh run_py.sh 
Time for Tree:
real	0m32.917s
user	0m39.615s
sys	0m0.380s

Time for RDF:
real	2m40.286s
user	2m45.966s
sys	0m1.618s

The jit in the rdf version is a point of contention since it does take about 30s.

Info in <[ROOT.RDF] Info /user/XXX/HEPTools/ROOT/root/tree/dataframe/src/RLoopManager.cxx:867 in void ROOT::Detail::RDF::RLoopManager::Jit()>: Just-in-time compilation phase completed in 30.630728 seconds.

But I am unsure if I can do anything about that, given the code is in Python. Further, the performance issue may be exemplified by adding a break statement in the tree code, which can cut down on the time further.

Let me know if you have any thoughts, and I am attaching the files below.

group_by_rdf.py (1.7 KB)
group_by_tree.py (1.5 KB)

P.S Cannot seem to upload a bash file the code for run_py.sh is :

#!/bin/bash
echo "Time for Tree:"
time python group_by_tree.py

echo "Time for RDF:"
time python group_by_rdf.py

system · January 31, 2025, 12:37pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.