I have a text file with two columns (say A and B) and a TChain containing the same two variables and many others. The number of rows between the text file and TChain is different (TChain > text file). I want to save in a final tree only events matching values in text file. So
if(A_i == A_chain_i) && (B_i == B_chain_i) keep the event (with all variables)
In other words, my final tree contains the events of TChain which intersect text file.
Can you suggest me the fastest procedure (via TTree or RDataFrame, python or C++)? Since I’m handling millions of events a “classical” loop approach is too much slow.
have you tried perhaps by saving in std::sets the values of column A and B in the text file, and filter events with RDF by checking if the value of column A and then B are in the respective sets, to then Snapshot the dataset?
So, parse your text file and build 2 std::sets with the values of the A and B columns in the text file.
Then, with RDF:
// supposing the sets are calld set_a and set_b, and that the type of the columns A and B are integers
auto filtered_rdf = rdf.Filter((int a)[&set_a]{return 1 == set_a.find(a);}, {"A"}).Filter((int b)[&set_b]{return 1 == set_b.find(b);}, {"B"});
rdf.Snapshot("myTree", "myFilteredFile.root");
Thanks again!
I have only 1 concern: a column can contain same values. In my task A is eventNumber and B is mcChannelNumber, so for a given mcChannelNumber you have values of eventNumber. Do you think your suggestions could work with other types?
You can certainly implement an “or”, by passing the two sets and reading the two columns, if that’s what you want:
auto filtered_rdf = rdf.Filter((int a, int b)[&set_a, &set_b]{return 1 == set_a.find(a) || 1 == set_b.find(b);}, {"A", "B"});
If for “other types”, you mean C++ types like double, long unsigned int or T, sure. It might be necessary to provide a custom comparator (see e.g. here), but that’s doable, no?
I’m not very used with lambda function, how does RDataFrame know to which branch the function should be applied?
Also should be inverted the first part? I mean
(int a)[&set_a]
// supposing the sets are calld set_a and set_b, and that the type of the columns A and B are integers
auto filtered_rdf = rdf.Filter([&set_a](int a){return 1 == set_a.find(a);}).Filter([&set_b](int b){return 1 == set_b.find(b);}, {"A", "B"});
rdf.Snapshot("myTree", "myFilteredFile.root");
I found this solution, slightly different in return function:
auto evCut = [&s_ev](unsigned long long int x){return s_ev.find(x) != s_ev.end();} ;
auto idCut = [&s_id](unsigned int x){return s_id.find(x) != s_id.end();} ;
auto filtered_df = df.Filter(evCut, {"eventNumber"}).Filter(idCut, {"mcChannelNumber"});
filtered_df.Snapshot("AnalysisMiniTree", "myFilteredFile.root");