Filtering 2 dataframes at once

seley · March 14, 2025, 10:32am

Hi,

I’m currently trying to filter two dataframes at once and use snapshot to give a ntuple from each dataframe that only contains the events that appear in both.

Note: These two dataframes come from the same sample that was created using two slightly different software versions - therefore should overall have the same events before filtering.

I believe my current version works but seems a bit inefficient. I was wondering what I could change about my method to make it more efficient.

I’m essentially taking in 2 files (both the same sample - created using different software versions). Applying the same baseline cuts, and defining same variables. I am then filtering dataframe a to find where a track charge is misidentied and dataframe b to find where the track charge is correctly identified. My goal is to output ntuples for both dataframes that only contain the events that appear in both filtered dataframes.

I have attached my code below.
DF_Comparison.py (2.7 KB)

Thank you in advance
Sinead

Please read tips for efficient and successful posting and posting code

Please fill also the fields below. Note that root -b -q will tell you this info, and starting from 6.28/06 upwards, you can call .forum bug from the ROOT prompt to pre-populate a topic.

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided

StephanH · March 14, 2025, 1:32pm

Hello @seley,

RDataframe is designed to run in a thread-safe context, so there cannot be any communication between two running dataframes.
I don’t see a better solution than running the dataframes twice, once to get the list of events that pass, and once to apply an event filter. The only “not so great” thing I see is that the numpy array gets converted into a string of the style of "t_p_gev == 1 || t_p_gev == 2 || ...", which can be quite a long expression to compile and test. If the whole thing runs fast enough, or if you won’t have to do this very often, I would however just keep it as it is.

That being said, in case you wanted to make the event filter easier to compile, you could write it in C++, and attach it to the computation graph. For this I use the AsRNode option from the the very bottom of the Python interface box in the docs:

ROOT.gInterpreter.Declare("""
ROOT::RDF::RNode AttachFilter(ROOT::RDF::RNode df, std::vector<unsigned long long> & inputEvents) {
    std::vector<unsigned long long> events = inputEvents;
    // sort the vector first, so we can use the faster binary_search
    // You can also sort in numpy and leave out this step
    std::sort(events.begin(), events.end());
    auto filter = [events](unsigned long long e){
      return std::binary_search(events.begin(), events.end(), e);
    };
    return df.Filter(filter, {"x"}); // Replace "x" by the event number
}
""")

Now you could use that in a new RDF as event filter:

# Cast the RDataFrame head node
df = ROOT.RDataFrame("myTree", "myFile.root")
# Convert from numpy to vector
vec = ROOT.vector("unsigned long")(numpy_array)
df_filtered = ROOT.AttachFilter(ROOT.RDF.AsRNode(df), vec)

You can probably write this event filter in a few different ways, but a binary search through a vector of eligible events seemed quick and simple.

seley · March 14, 2025, 4:28pm

Hi,
Thanks - I’ve adapted this for the datatype I’m filtering for and it worked nicely

system · March 28, 2025, 4:29pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.