Selecting events in a TChain based on variable stored in a text file

Dear experts,

I have a text file with two columns (say A and B) and a TChain containing the same two variables and many others. The number of rows between the text file and TChain is different (TChain > text file). I want to save in a final tree only events matching values in text file. So

if(A_i == A_chain_i) && (B_i == B_chain_i) keep the event (with all variables)

In other words, my final tree contains the events of TChain which intersect text file.

Can you suggest me the fastest procedure (via TTree or RDataFrame, python or C++)? Since I’m handling millions of events a “classical” loop approach is too much slow.

Thanks for the attention.

Best regards,
Francesco

Hi Francesco,

have you tried perhaps by saving in std::sets the values of column A and B in the text file, and filter events with RDF by checking if the value of column A and then B are in the respective sets, to then Snapshot the dataset?

Cheers,
Danilo

Hello @Danilo,

Thanks for you reply!
Could you please give more information on this procedure?

Thanks,
Francesco

Ciao Francesco,

So, parse your text file and build 2 std::sets with the values of the A and B columns in the text file.

Then, with RDF:

// supposing the sets are calld set_a and set_b, and that the type of the columns A and B are integers

auto filtered_rdf = rdf.Filter((int a)[&set_a]{return 1 == set_a.find(a);}, {"A"}).Filter((int b)[&set_b]{return 1 == set_b.find(b);}, {"B"});
rdf.Snapshot("myTree", "myFilteredFile.root");

I hope this helps!

Cheers,
D

Ciao @Danilo

Thanks again!
I have only 1 concern: a column can contain same values. In my task A is eventNumber and B is mcChannelNumber, so for a given mcChannelNumber you have values of eventNumber. Do you think your suggestions could work with other types?

Thanks,
Francesco

Hi Francesco,

You can certainly implement an “or”, by passing the two sets and reading the two columns, if that’s what you want:

auto filtered_rdf = rdf.Filter((int a, int b)[&set_a, &set_b]{return 1 == set_a.find(a) || 1 == set_b.find(b);}, {"A", "B"});

If for “other types”, you mean C++ types like double, long unsigned int or T, sure. It might be necessary to provide a custom comparator (see e.g. here), but that’s doable, no?

I hope this helps.

Cheers,
D

Hello @Danilo

I’m not very used with lambda function, how does RDataFrame know to which branch the function should be applied?

Also should be inverted the first part? I mean

(int a)[&set_a]
// supposing the sets are calld set_a and set_b, and that the type of the columns A and B are integers

auto filtered_rdf = rdf.Filter([&set_a](int a){return 1 == set_a.find(a);}).Filter([&set_b](int b){return 1 == set_b.find(b);}, {"A", "B"});
rdf.Snapshot("myTree", "myFilteredFile.root");

Francesco

Hi Francesco,

I edited the examples above, I clearly had forgotten to list the columns. Now it’s all fine.

Best,
D

Hello Danilo,

I found this solution, slightly different in return function:

    auto evCut = [&s_ev](unsigned long long int x){return s_ev.find(x) != s_ev.end();} ;
    auto idCut = [&s_id](unsigned  int x){return s_id.find(x) != s_id.end();} ;

    auto filtered_df = df.Filter(evCut, {"eventNumber"}).Filter(idCut, {"mcChannelNumber"});
    filtered_df.Snapshot("AnalysisMiniTree", "myFilteredFile.root");

Thanks for the help!

Cheers,
Francesco

Great. Thanks for sharing.

Cheers,
D

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.