Compare two RDataFrame branch/column values

Hi,

I’m trying to find out whether a value of an event of one dataset for a column/branch is present in another dataset for the same column/branch.

Currently I have the following C++ declared in pyROOT (as an example):

void multiple_candidates(ROOT::RDataFrame ws_df, ROOT::RDataFrame rs_df){
    ROOT::RDF::RNode ws_node = ws_df;
    ROOT::RDF::RNode rs_node = rs_df;
    ws_node = ws_node.Define("unique_event_number",[](const unsigned int& rn, const unsigned long long& en) -> unsigned long long {return 10e+9*rn+en;},{"runNumber","eventNumber"});
    rs_node = rs_node.Define("unique_event_number",[](const unsigned int& rn, const unsigned long long& en) -> unsigned long long {return 10e+9*rn+en;},{"runNumber","eventNumber"});
    std::vector<unsigned long long> uevs_ws;
    std::vector<unsigned long long> uevs_rs;

    ws_node.Foreach([&uevs_ws](const unsigned long long& uev){
    uevs_ws.push_back(uev);
    }, {"unique_event_number"});
    std::cout << "Length of WS uev vector: " << uevs_ws.size() << std::endl;

    rs_node.Foreach([&uevs_rs](const unsigned long long& uev){
    uevs_rs.push_back(uev);
    }, {"unique_event_number"});
    std::cout << "Length of RS uev vector: " << uevs_rs.size() << std::endl;
    //sort the uev vectors
    std::sort(uevs_ws.begin(), uevs_ws.end());
    std::sort(uevs_rs.begin(), uevs_rs.end());
    //find unique event numbers that are in both the RS and WS.
    std::vector<unsigned long long> common_uevs;
    std::set_intersection(uevs_ws.begin(), uevs_ws.end(), uevs_rs.begin(), uevs_rs.end(), std::back_inserter(common_uevs));
    //check if eventnumber is in common_uevs
//This is what I want to do. This doesn't work but perhaps it will with a lambda function???
 ws_node = ws_node.Define("in_both","std::binary_search(unique_event_number,common_uevs) );
 rs_node = rs_node.Define("in_both","std::binary_search(unique_event_number,common_uevs) );
}

It doesn’t seem to work using the method above but can it work with a lambda function? Is this the most efficient way to compare two dataframes?


_ROOT Version:v6.26.06


Hi Jcob,

First solution that comes to my mind is following:

import ROOT
import numpy as np

d1 = ROOT.RDataFrame(17).Define("x", "gRandom->Integer(30)")
d2 = ROOT.RDataFrame(42).Define("y", "gRandom->Integer(30)")

arr1 = d1.AsNumpy()['x']
print(arr1)
# [29  4  8 28  6 14 28 22 16 22 22 19  9 24 15  5 14]
arr2 = d2.AsNumpy()['y']
print(arr2)
# [11  6  6  0 10  5 28 17 26 19 14 16  5  8  3  1 19 21 19 21  2 20  3  3 15  6  8  2  3 16  2  8 26  6 12  4 12 20  5 13  1  9]

common_elements = np.intersect1d(arr1, arr2)
print(common_elements)
# [ 4  5  6  8  9 14 15 16 19 28]
1 Like

Hi @jcob ,

  // This is what I want to do. This doesn't work but perhaps it will with a
  // lambda function???
 ws_node = ws_node.Define("in_both","std::binary_search(unique_event_number,common_uevs)");
 rs_node = rs_node.Define("in_both","std::binary_search(unique_event_number,common_uevs)");

Yes it should work with a lambda function that captures common_uevs, e.g. (untested but it should give you the idea):

df.Define("in_both", [&common_uevs](long long int unique_event_number) { return isInVector(unique_event_number, common_events); }, {"unique_event_number"});

Also as @FoxWise suggests you can just use Take instead of manually filling a vector in a Foreach.

Cheers,
Enrico

@eguiraud Thanks, I forgot about lambda captures. The column definition now works. Kind of unrelated but how would I return a dataframe or something that can be manipulated like a dataframe (RNode or something else)? If I return an RNode, the object that the function returns has no entries.

@FoxWise, thanks for the suggestion. I don’t think this method is too practical for my case since I’m looking at datasets with ~100M events so I think C++ would be more efficient. I will bear this in mind if I need to run some quick comparisons in python.

I’m not sure what you mean. Passing around RNodes is how you pass around RDF objects.

@eguiraud , sorry I figured out that there is another issue.

In my case I have the following C++ function declared:

ROOT::RDF::RNode multiple_candidates(ROOT::RDataFrame ws_df, ROOT::RDataFrame rs_df,std::string return_str="WS"){
    ROOT::RDF::RNode ws_node = ws_df;
    ROOT::RDF::RNode rs_node = rs_df;
    ws_node = ws_node.Define("unique_event_number",[](const unsigned int& rn, const unsigned long long& en) -> unsigned long long {return 10e+9*rn+en;},{"runNumber","eventNumber"});
    rs_node = rs_node.Define("unique_event_number",[](const unsigned int& rn, const unsigned long long& en) -> unsigned long long {return 10e+9*rn+en;},{"runNumber","eventNumber"});
    std::vector<unsigned long long> uevs_ws;
    std::vector<unsigned long long> uevs_rs;

    ws_node.Foreach([&uevs_ws](const unsigned long long& uev){
    uevs_ws.push_back(uev);
    }, {"unique_event_number"});
    std::cout << "Length of WS uev vector: " << uevs_ws.size() << std::endl;

    rs_node.Foreach([&uevs_rs](const unsigned long long& uev){
    uevs_rs.push_back(uev);
    }, {"unique_event_number"});
    std::cout << "Length of RS uev vector: " << uevs_rs.size() << std::endl;
    //sort the uev vectors
    std::sort(uevs_ws.begin(), uevs_ws.end());
    std::sort(uevs_rs.begin(), uevs_rs.end());
    //find unique event numbers that are in both the RS and WS.
    std::vector<unsigned long long> common_uevs;
    std::set_intersection(uevs_ws.begin(), uevs_ws.end(), uevs_rs.begin(), uevs_rs.end(), std::back_inserter(common_uevs));
    std::cout << "Number of common uevs between RS and WS: " << common_uevs.size() << std::endl;
    std::cout << "Testing binary_search" << std::endl;
    for(int i = 0; i < 5;i++){
       std::cout << std::binary_search(common_uevs.begin(),common_uevs.end(),uevs_rs[i]) << std::endl;
    }
    ws_node = ws_node.Define("in_both",[&common_uevs](const unsigned long long& uev) -> bool {return std::binary_search(common_uevs.begin(),common_uevs.end(),uev);},{"unique_event_number"});
    rs_node = rs_node.Define("in_both",[&common_uevs](const unsigned long long& uev) -> bool {return std::binary_search(common_uevs.begin(),common_uevs.end(),uev);},{"unique_event_number"});
    if(return_str == "WS"){
        return ws_node;
    }
    return rs_node;
}

When I do filter events from one of the returned dataframes, e.g.

test_rs = ROOT.multiple_candidates(data_frame_ws,data_frame_rs,"RS")
print(test_rs .Filter("in_both == 0").Count().GetValue()) # Filters out all events and returns 0
print(test_rs .Filter("in_both == 1").Count().GetValue()) # All events pass cuts and number of entries is equal to original

The above implies that the lambda function returned true for all entries.
However, in the small for loop I have to make sure binary_search returns things correctly, I get false values, so the values returned by the lambda function are clearly wrong.

Am I doing something wrong in the lambda function or is this a known issue with RDataFrame?

No, Filter("in_both == 0/1').Count() is expected to return the number of events for which in_both is equal to 0/1. I’m afraid this needs debugging. For example you can put a std::cout inside the lambda expression you pass to the Define (and or the Filter) to see what it sees.

Cheers,
Enrico

Adding a print statement to the lambda expression, std::binary_search always returns 1. I added an if statement that fires if std::binary_search returns 1. In the if statement I use std::find to return the index in the vector that it claims to have found. However, this if statement never fired!
Removing this if statement caused a crash when returning the value of the index from std::find.

I think I noticed a problem:

[&common_uevs](const unsigned long long &uev) -> bool {
    return std::binary_search(common_uevs.begin(), common_uevs.end(), uev);
}

Here you take a reference to common_uevs that will be used by the lambda expression, but the lambda expression will only run when the event loop runs, so after multiple_candidates has returned, so after common_uevs has gone out of scope!

Quickfix: use [common_uevs] instead. Does that help?

Ah this makes sense!
Your fix works perfectly. Thank you so much!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.