I’m currently trying to filter two dataframes at once and use snapshot to give a ntuple from each dataframe that only contains the events that appear in both.
Note: These two dataframes come from the same sample that was created using two slightly different software versions - therefore should overall have the same events before filtering.
I believe my current version works but seems a bit inefficient. I was wondering what I could change about my method to make it more efficient.
I’m essentially taking in 2 files (both the same sample - created using different software versions). Applying the same baseline cuts, and defining same variables. I am then filtering dataframe a to find where a track charge is misidentied and dataframe b to find where the track charge is correctly identified. My goal is to output ntuples for both dataframes that only contain the events that appear in both filtered dataframes.
Please fill also the fields below. Note that root -b -q will tell you this info, and starting from 6.28/06 upwards, you can call .forum bug from the ROOT prompt to pre-populate a topic.
ROOT Version: Not Provided Platform: Not Provided Compiler: Not Provided
RDataframe is designed to run in a thread-safe context, so there cannot be any communication between two running dataframes.
I don’t see a better solution than running the dataframes twice, once to get the list of events that pass, and once to apply an event filter. The only “not so great” thing I see is that the numpy array gets converted into a string of the style of "t_p_gev == 1 || t_p_gev == 2 || ...", which can be quite a long expression to compile and test. If the whole thing runs fast enough, or if you won’t have to do this very often, I would however just keep it as it is.
That being said, in case you wanted to make the event filter easier to compile, you could write it in C++, and attach it to the computation graph. For this I use the AsRNode option from the the very bottom of the Python interface box in the docs:
ROOT.gInterpreter.Declare("""
ROOT::RDF::RNode AttachFilter(ROOT::RDF::RNode df, std::vector<unsigned long long> & inputEvents) {
std::vector<unsigned long long> events = inputEvents;
// sort the vector first, so we can use the faster binary_search
// You can also sort in numpy and leave out this step
std::sort(events.begin(), events.end());
auto filter = [events](unsigned long long e){
return std::binary_search(events.begin(), events.end(), e);
};
return df.Filter(filter, {"x"}); // Replace "x" by the event number
}
""")
Now you could use that in a new RDF as event filter:
# Cast the RDataFrame head node
df = ROOT.RDataFrame("myTree", "myFile.root")
# Convert from numpy to vector
vec = ROOT.vector("unsigned long")(numpy_array)
df_filtered = ROOT.AttachFilter(ROOT.RDF.AsRNode(df), vec)
You can probably write this event filter in a few different ways, but a binary search through a vector of eligible events seemed quick and simple.