Selecting events in a TChain based on variable stored in a text file

Francesco_Cirotto · January 24, 2025, 11:02am

Dear experts,

I have a text file with two columns (say A and B) and a TChain containing the same two variables and many others. The number of rows between the text file and TChain is different (TChain > text file). I want to save in a final tree only events matching values in text file. So

if(A_i == A_chain_i) && (B_i == B_chain_i) keep the event (with all variables)

In other words, my final tree contains the events of TChain which intersect text file.

Can you suggest me the fastest procedure (via TTree or RDataFrame, python or C++)? Since I’m handling millions of events a “classical” loop approach is too much slow.

Thanks for the attention.

Best regards,
Francesco

Danilo · January 24, 2025, 2:12pm

Hi Francesco,

have you tried perhaps by saving in std::sets the values of column A and B in the text file, and filter events with RDF by checking if the value of column A and then B are in the respective sets, to then Snapshot the dataset?

Cheers,
Danilo

Francesco_Cirotto · January 24, 2025, 2:18pm

Hello @Danilo,

Thanks for you reply!
Could you please give more information on this procedure?

Thanks,
Francesco

Danilo · January 24, 2025, 2:51pm

Ciao Francesco,

So, parse your text file and build 2 std::sets with the values of the A and B columns in the text file.

Then, with RDF:

// supposing the sets are calld set_a and set_b, and that the type of the columns A and B are integers

auto filtered_rdf = rdf.Filter((int a)[&set_a]{return 1 == set_a.find(a);}, {"A"}).Filter((int b)[&set_b]{return 1 == set_b.find(b);}, {"B"});
rdf.Snapshot("myTree", "myFilteredFile.root");

I hope this helps!

Cheers,
D

Francesco_Cirotto · January 24, 2025, 3:05pm

Ciao @Danilo

Thanks again!
I have only 1 concern: a column can contain same values. In my task A is eventNumber and B is mcChannelNumber, so for a given mcChannelNumber you have values of eventNumber. Do you think your suggestions could work with other types?

Thanks,
Francesco

Danilo · January 24, 2025, 3:28pm

Hi Francesco,

You can certainly implement an “or”, by passing the two sets and reading the two columns, if that’s what you want:

auto filtered_rdf = rdf.Filter((int a, int b)[&set_a, &set_b]{return 1 == set_a.find(a) || 1 == set_b.find(b);}, {"A", "B"});

If for “other types”, you mean C++ types like double, long unsigned int or T, sure. It might be necessary to provide a custom comparator (see e.g. here), but that’s doable, no?

I hope this helps.

Cheers,
D

Francesco_Cirotto · January 24, 2025, 5:17pm

Hello @Danilo

I’m not very used with lambda function, how does RDataFrame know to which branch the function should be applied?

Also should be inverted the first part? I mean

(int a)[&set_a]

// supposing the sets are calld set_a and set_b, and that the type of the columns A and B are integers

auto filtered_rdf = rdf.Filter([&set_a](int a){return 1 == set_a.find(a);}).Filter([&set_b](int b){return 1 == set_b.find(b);}, {"A", "B"});
rdf.Snapshot("myTree", "myFilteredFile.root");

Francesco

Danilo · January 24, 2025, 7:03pm

Hi Francesco,

I edited the examples above, I clearly had forgotten to list the columns. Now it’s all fine.

Best,
D

Francesco_Cirotto · January 24, 2025, 10:09pm

Hello Danilo,

I found this solution, slightly different in return function:

    auto evCut = [&s_ev](unsigned long long int x){return s_ev.find(x) != s_ev.end();} ;
    auto idCut = [&s_id](unsigned  int x){return s_id.find(x) != s_id.end();} ;

    auto filtered_df = df.Filter(evCut, {"eventNumber"}).Filter(idCut, {"mcChannelNumber"});
    filtered_df.Snapshot("AnalysisMiniTree", "myFilteredFile.root");

Thanks for the help!

Cheers,
Francesco

Danilo · January 25, 2025, 6:34am

Great. Thanks for sharing.

Cheers,
D

system · February 8, 2025, 6:34am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.