ROOT version: 6.32.10
Dear experts,
I am writing a function that processes a RDataFrame by category, performs some filtering, and saves the output in a new root file. To put some context, my root file contains duplicates of a same physical event, and the goal is to treat each duplicate “chunks” to select the appropriate candidate for the event.
The goal would be to replicate the functionality of pandas
’s groupby()
.
To be very explicit, I would like to iterate over the uniques EVENTNUMBER of my file, take all of the events sharing the same EVENTNUMBER, then keep only the event with the highest value of a column (or discard the events altogether if the value is not high enough)
This operation seems quite easy and is achievable in just a couple of lines in pandas/with most python librairies, however RDataFrame
seems a bit less flexible on this, or I am not finding the right strategy.
I would appreciate greatly some help.
Here is a code snippet to generate a dummy dataframe:
#include <vector>
#include <iostream>
#include <random>
int test() {
// Number of events
const int n_events = 100;
std::vector<int> event_numbers;
std::vector<double> values;
// Random number generator
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<double> dis(0.0, 1.0);
// Generate event numbers with repetitions
for (int i = 0; i < n_events; ++i) {
event_numbers.push_back(i / 5); // Repeating event numbers every 5 rows
values.push_back(dis(gen));
}
// Create a ROOT DataFrame
ROOT::RDataFrame rdf(n_events);
auto df = rdf.Define("EVENTNUMBER", [event_numbers](ULong64_t i) { return event_numbers[i]; }, {"rdfentry_"})
.Define("VALUE", [values](ULong64_t i) { return values[i]; }, {"rdfentry_"});
// Display the first few rows
df.Display({"EVENTNUMBER", "VALUE"}, 15)->Print();
return 0;
}
The result of Display()
gives:
+-----+-------------+----------+
| Row | EVENTNUMBER | VALUE |
+-----+-------------+----------+
| 0 | 0 | 0.963746 |
+-----+-------------+----------+
| 1 | 0 | 0.812817 |
+-----+-------------+----------+
| 2 | 0 | 0.163468 |
+-----+-------------+----------+
| 3 | 0 | 0.221299 |
+-----+-------------+----------+
| 4 | 0 | 0.586073 |
+-----+-------------+----------+
| 5 | 1 | 0.886833 |
+-----+-------------+----------+
| 6 | 1 | 0.389196 |
+-----+-------------+----------+
| 7 | 1 | 0.415763 |
+-----+-------------+----------+
| 8 | 1 | 0.843191 |
+-----+-------------+----------+
| 9 | 1 | 0.690809 |
+-----+-------------+----------+
| 10 | 2 | 0.373107 |
+-----+-------------+----------+
| 11 | 2 | 0.019321 |
+-----+-------------+----------+
| 12 | 2 | 0.777981 |
+-----+-------------+----------+
| 13 | 2 | 0.087130 |
+-----+-------------+----------+
| 14 | 2 | 0.853490 |
+-----+-------------+----------+
And I would like to keep only the highest value for events sharing a same EVENTNUMBER, eg in this example row 0, 5, 14, etc…
If possible I would like to run multithreaded, so without the possibility to use rdfentry_
(the row number)
Up to now I have tried the following:
// Retrieve the event numbers
auto col_evtnum = df.Take<int>("EVENTNUMBER"); //std::vec<int>
// Convert the vector in an unordered set to get list of unique event numbers
std::unordered_set<int> evt_numbers(col_evtnum->begin(), col_evtnum->end());
// start looping over the event numbers
for ( auto EVTNUM : evt_numbers ) {
// group by duplicates with same EVENTNUMBER
std::string filter_evt_num = std::format("EVENTNUMBER == {}", EVTNUM);
auto df2 = df.Filter(filter_evt_num);
// Retrieve mva score among the candidates
auto max_val = df2.Max("VALUE").GetValue();
auto df3 = df2.Filter(
[&max_val](double val) {
if (val < 0.9) { return false; } // if below a threshold, discard everything
else { return val==max_val; } // else just keep the one with highest value
}, {"VALUE"}
);
}
The point is that I can manage to do what I want for each subset, but I can’t bring everything together in the same original dataframe. Thanks for helping me find a solution!