Handle event duplication

Dear all,

I am processing some data written to different datasets. Each dataset contains events from a different (non-exclusive) trigger stream. The result is that I have some degree of event duplication/overlap, i.e. the same event is stored in multiple files.

Is there a way to effectively handle this situation in proof?

I know I could just create a digested ntuple with event numbers and the relevant info and then run some overlap removal on that one.

On the other hand, I am outputting mostly histograms, so I was wondering if there is a way for workers to signal the master asking “did anybody already process this?”.

Thanks for your help,

Andrea.

Dear Andrea,

I think I understand your problem, but what you need would introduce a (heavy) dependency between the workers and violate the basic PROOF paradigm.

Keeping and distributing a list of processed events to all workers maybe very heavy. The only way out is to split the process in two, somewhat as you have outlined, because you need somehow to keep the information to check for the duplications.

You can possibly do that in one go, having the workers selecting the information and saving in - for example- a TTree, and the master checking for duplications during the merging phase and filling the final histograms.
You will have to write your ad hoc output class for that, with the dedicated Merge method. I can help you in doing this, if you want to give a try.

G. Ganis

Dear Gerri

thanks for your answer.

Yes that sounded sort of improbable to me as well. Still, it was worth a try :slight_smile:

What about removing the overlaps before processing instead of during it? I still don’t want to modify the input files, nor to duplicate them removing the redundant events.

However, if the TDSet had some ways of checking overlaps and keeping a list of entries to be skipped, it may effectively instruct the workers without causing too much traffic.

jm2c,

Andrea.