Filtering duplicate event numbers from a RDataFrame


is there an easy way to use the Filter functionality to select events with, for example, the same event number? I would like to use RDF to filter out duplicate events, but I can’t figure out how to do this in an easy way.

Thanks ! Yes, that would be great. I presume that the table-like format of RDF would be able to handle a “sort unique” type of operation, but I did not see anything like that in the documentation. So I thought I’d check with the experts before trying something complicated. Hints appreciated!

Hi @LouieC ,
RDataFrame does not provide such an operation because the trivial implementation of a full sort+unique requires all data to be in memory, and we typically deal with larger-than memory datasets.

Depending on your actual usecase there are a number of ways you can go about this. For example you can have a stateful (thread-safe) Filter function that returns true if it has never seen an event number and false otherwise.


An example of a stateful thread-safe filter is now available at A thread-safe stateful Filter for RDataFrame · GitHub