Dear Experts,
I have a RDataFrame which contains multiple candidates for each event. Multiple candidates are entries with the same event number ID (eventNumber). I would like to filter these multiple candidates based on the value of another variable in the tree. I was trying to follow @eguiraud 's example in A thread-safe stateful Filter for RDataFrame · GitHub extending his approach to compare the values of another variable and returning the element with highest value of such variable. To create an example: with the following I create a rdf with a set of entries with non-unique IDs (“category”) each entry has a unique variable value (“x”)
import ROOT
ROOT.gInterpreter.Declare(
"""
#include <vector>
#include <algorithm>
#include <map>
#include <random>
std::random_device rd;
std::mt19937 e{rd()}; // or std::default_random_engine e{rd()};
std::normal_distribution<float> gaus(10., 1.);
float generateGaussNumber(){
return gaus(e);
}
""")
df = ROOT.RDataFrame(20).Define("category", "int(rdfentry_ % 10)").Define("x", "generateGaussNumber()")
print(df.AsNumpy())
Which gives the following RDF (category, x):
array([[ 0. , 11.11326599],
[ 1. , 10.15282154],
[ 2. , 10.73499107],
[ 3. , 10.45162868],
[ 4. , 10.2524004 ],
[ 5. , 11.10525703],
[ 6. , 9.92098904],
[ 7. , 11.83233261],
[ 8. , 11.15157604],
[ 9. , 9.56202412],
[ 0. , 8.48303699],
[ 1. , 9.08411694],
[ 2. , 9.12842464],
[ 3. , 10.6169529 ],
[ 4. , 10.09912395],
[ 5. , 12.2478199 ],
[ 6. , 8.52565002],
[ 7. , 12.2470808 ],
[ 8. , 9.79731941],
[ 9. , 11.51276779]])
By defining the ElementTracker:
ROOT.gInterpreter.Declare("""
// A thread-safe stateful filter that lets only one event pass for each value of
// "category" (where "category" is a random character).
// It is using gCoreMutex, which is a read-write lock, to have a bit less contention between threads.
#include <unordered_map>
class ElementTracker {
public:
// Method to check if a given element (represented by id and value) has the highest value among elements with the same id
bool operator()(int id, float value) {
// Check if the current element has a higher value than the stored value
R__READ_LOCKGUARD(ROOT::gCoreMutex); // many threads can take a read lock concurrently
if (highestValues.find(id) == highestValues.end() || value > highestValues[id]) {
R__READ_LOCKGUARD(ROOT::gCoreMutex); // many threads can take a read lock concurrently
highestValues[id] = value;
return true;
}
return value == highestValues[id];
}
private:
// Map to store the highest value for each id
std::unordered_map<int, float> highestValues;
};
""")
And by calling:
cols = ROOT.std.vector['string'](["category","x"])
df_with_unique_categories = df.Filter(ROOT.ElementTracker(), cols)
I would like to reduced the above RDF to the following one:
array([[ 0. , 11.11326599],
[ 1. , 10.15282154],
[ 2. , 10.73499107],
[ 4. , 10.2524004 ],
[ 6. , 9.92098904],
[ 8. , 11.15157604],
[ 3. , 10.6169529 ],
[ 5. , 12.2478199 ],
[ 7. , 12.2470808 ],
[ 9. , 11.51276779]])
i.e. I’d like to keep those uniques “categories” with highest “x”. Any idea of how I could enhance my function to do this? Or would you suggest a different method than Filter?
Thanks,
Davide
ROOT Version: 6.30/04