Filtering duplicate event numbers from a RDataFrame

Hi,

is there an easy way to use the Filter functionality to select events with, for example, the same event number? I would like to use RDF to filter out duplicate events, but I can’t figure out how to do this in an easy way.

best wishes

Louie

Welcome to the ROOT forum!
I’m sure @eguiraud can give you some hints

Thanks ! Yes, that would be great. I presume that the table-like format of RDF would be able to handle a “sort unique” type of operation, but I did not see anything like that in the documentation. So I thought I’d check with the experts before trying something complicated. Hints appreciated!

Hi @LouieC ,
and welcome to the ROOT forum!
RDataFrame does not provide such an operation because the trivial implementation of a full sort+unique requires all data to be in memory, and we typically deal with larger-than memory datasets.

Depending on your actual usecase there are a number of ways you can go about this. For example you can have a stateful (thread-safe) Filter function that returns true if it has never seen an event number and false otherwise.

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

An example of a stateful thread-safe filter is now available at A thread-safe stateful Filter for RDataFrame · GitHub