Process RDataFrame by event category

leadreyfus · February 26, 2025, 5:19pm

ROOT version: 6.32.10

Dear experts,

I am writing a function that processes a RDataFrame by category, performs some filtering, and saves the output in a new root file. To put some context, my root file contains duplicates of a same physical event, and the goal is to treat each duplicate “chunks” to select the appropriate candidate for the event.

The goal would be to replicate the functionality of pandas’s groupby().
To be very explicit, I would like to iterate over the uniques EVENTNUMBER of my file, take all of the events sharing the same EVENTNUMBER, then keep only the event with the highest value of a column (or discard the events altogether if the value is not high enough)

This operation seems quite easy and is achievable in just a couple of lines in pandas/with most python librairies, however RDataFrame seems a bit less flexible on this, or I am not finding the right strategy.
I would appreciate greatly some help.

Here is a code snippet to generate a dummy dataframe:

#include <vector>
#include <iostream>
#include <random>

int test() {
    // Number of events
    const int n_events = 100;

    std::vector<int> event_numbers;
    std::vector<double> values;

    // Random number generator
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<double> dis(0.0, 1.0);

    // Generate event numbers with repetitions
    for (int i = 0; i < n_events; ++i) {
        event_numbers.push_back(i / 5); // Repeating event numbers every 5 rows
        values.push_back(dis(gen));
    }

    // Create a ROOT DataFrame
    ROOT::RDataFrame rdf(n_events);
    auto df = rdf.Define("EVENTNUMBER", [event_numbers](ULong64_t i) { return event_numbers[i]; }, {"rdfentry_"})
                 .Define("VALUE", [values](ULong64_t i) { return values[i]; }, {"rdfentry_"});


    // Display the first few rows
    df.Display({"EVENTNUMBER", "VALUE"}, 15)->Print();

    return 0;
}

The result of Display() gives:

+-----+-------------+----------+
| Row | EVENTNUMBER | VALUE    | 
+-----+-------------+----------+
| 0   | 0           | 0.963746 | 
+-----+-------------+----------+
| 1   | 0           | 0.812817 | 
+-----+-------------+----------+
| 2   | 0           | 0.163468 | 
+-----+-------------+----------+
| 3   | 0           | 0.221299 | 
+-----+-------------+----------+
| 4   | 0           | 0.586073 | 
+-----+-------------+----------+
| 5   | 1           | 0.886833 | 
+-----+-------------+----------+
| 6   | 1           | 0.389196 | 
+-----+-------------+----------+
| 7   | 1           | 0.415763 | 
+-----+-------------+----------+
| 8   | 1           | 0.843191 | 
+-----+-------------+----------+
| 9   | 1           | 0.690809 | 
+-----+-------------+----------+
| 10  | 2           | 0.373107 | 
+-----+-------------+----------+
| 11  | 2           | 0.019321 | 
+-----+-------------+----------+
| 12  | 2           | 0.777981 | 
+-----+-------------+----------+
| 13  | 2           | 0.087130 | 
+-----+-------------+----------+
| 14  | 2           | 0.853490 | 
+-----+-------------+----------+

And I would like to keep only the highest value for events sharing a same EVENTNUMBER, eg in this example row 0, 5, 14, etc…
If possible I would like to run multithreaded, so without the possibility to use rdfentry_ (the row number)

Up to now I have tried the following:

// Retrieve the event numbers
    auto col_evtnum = df.Take<int>("EVENTNUMBER"); //std::vec<int>

    // Convert the vector in an unordered set to get list of unique event numbers
    std::unordered_set<int> evt_numbers(col_evtnum->begin(), col_evtnum->end());

    // start looping over the event numbers
    for ( auto EVTNUM : evt_numbers ) {
        // group by duplicates with same EVENTNUMBER
        std::string filter_evt_num = std::format("EVENTNUMBER == {}", EVTNUM);
        auto df2 = df.Filter(filter_evt_num);

        // Retrieve mva score among the candidates
        auto max_val = df2.Max("VALUE").GetValue();

        auto df3 = df2.Filter(
            [&max_val](double val) { 
                if (val < 0.9) { return false; } // if below a threshold, discard everything
                else { return val==max_val; } // else just keep the one with highest value
            }, {"VALUE"}
        );
    }

The point is that I can manage to do what I want for each subset, but I can’t bring everything together in the same original dataframe. Thanks for helping me find a solution!

mczurylo · February 27, 2025, 9:21am

Hi @leadreyfus,

unfortunately we don’t have a native way of doing GroupBy in RDataFrame yet. Your workaround is probably the best direction to be taken, you can make a custom action out of it - see a similar issue and discussion here.

Having said that, I would like to add that this is a feature that is requested by users on various occasions and it is on our list of to do items, and with more user requests we will add more priority to this particular item - so thank you for reporting your use case!

Cheers,
Marta

system · March 13, 2025, 9:22am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.