How to skip repeated events on chained TTrees when using TProof?

amvargash · April 17, 2020, 7:29pm

Hello,

I have some root files: a,b,c… These files contain a TTree with the same name t1.

t1 contains two branches identifying the events run and event (ULong64_t), these two elements identify uniquely one event.

I’m using TChain and TTreeReader to loop through the events, however the files DO NOT contain unique events, meaning events that are in a may be found on the trees from b and c.

When running multiple workers with TProof how can I make sure that the same event is not added multiple times to the histograms?

Thanks a lot!

jblomer · April 19, 2020, 8:50pm

@ganis Can you help?

ganis · April 20, 2020, 6:44am

Hello,

I am afraid this is against the parallel processing model of TTree or RNtuple, where events are by assumption independent.
You should cure your files before, removing duplications. You can create TEntryList with unique entries. Or you can safe your output in a TTree and fill the histograms from there filtering out duplications.
But I think the more solid way would be to invest a bit of thinking to avoid producing files with duplications.

G Ganis

amvargash · April 20, 2020, 1:50pm

Thanks @ganis I used TEntryList to create an initial smaller set (but still duplicated entries, as the production of the files is beyond my control). I am thinking about a way to store an additional unique id std::stol(std::to_string(run)+std::sto_string(event)) (provided that this is still within long range) so then I can work on modifying TEntryList instead of dealing with entire TTrees. I tried to store this “id” in place of the entry number on TEntryList but it crashes, looking at the code it seems it needs to be a reference to the TTree, but I may use TEntryListBlock to create a “custom” TEntryList to store this “event id” that I can use to filter out the repeated events. Do you think that’s a reasonable approach?

Axel · April 24, 2020, 11:43am

That depends on a lot of details. I assume that duplicate entries are consecutive?

If duplication is rare, try this: create a new tree in a separate file, with just one branch "duplicate", and you set that - say bool branch - to true when it’s a duplicate entry. When you process your main tree you add that new tree with the duplication info as a friend. Now you can process the tree and skip duplicate entries.

If duplication happens often, fill a TEntryList:
Fill the TTree entry number; remember the event and run number.
Now move to the next TTree entry, check its event and run number. If different, remember them for the next-to-next entry, and fill the TEntryList. Else skip that entry. Repeat.

amvargash · April 29, 2020, 3:36pm

Hi @Axel, thanks. I ended up doing something similar, if not the same. From the first dataset got a TTree with information regarding the run and event numbers by using the std::stol (to save some memory, not keeping them individually, this has some range limitations but it works in my particular case) and created a TEntryList for it. Then for the following dataset, I load the TTree content in a std::unordered_set to quickly check if the event being seen is duplicated and if not adding to the TEntryList (and keeping a TTree with the new added events) and repeating the process for the following datasets. I can’t use many workers bcs of memory limitations (the unordered_set gets big) but that was handled adding additional cuts of which events are needed. Finally adding all the TEntryList which is being used for the histogram filling

Note: In this case duplicate entries are not consecutive, duplicated events are found in different files

system · May 13, 2020, 3:36pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.