RDataFrame and multiple candidates taggers

Dear ROOT expert and RDataFrame ones,
I have a very specific implementation in mind.
In practice i have RVec<double> column which represent my weight to use to fill 100 x N-observable x 3 histograms.
I can easily implement a Histo1D case or some sort of functor which is going to Fill my 100 histograms, however when doing this, all entries passing a given selection will be used.
Now, within the selected events i redo a check to strip out events having some matching condition which i call “multiple-candidate” , i do some selection for that and then i fill some histograms only with that events.
I was thinking how one can achieve avoiding to Take the columsn something like that.

For example i was thinking to write up the code in this way :

RDataFrame df( GetTupleFromSomewhere()); 
auto booSelPass = df.Define("selectionPass", "mycut").Take<bool>("selectionPass");
auto rdfEntry = df.Take("_rdfentry"); 
//some other take to tag unique candidates (columnIDEvents ) 
vector<int> _entries_to_skip = GetEntriesIndexesToSkip( booSelPass, rdfEntry , columIDEvents); 
class MyHistoFiller{ 
   MyHistoFiller( vector<int> _entries_to_skip);
  operator( double varFill, int _rdfEntry, RVed<double> weightColumn ) { 
      //check if _rdfEntry is in m_entries_to_skip; 
      //if not 
      for( int i = 0; i < weightColumn->size(); ++i){ 
          _histosBs[i]->Fill( varFill, weightColumn); 
  private : 
    vector<TH1D> _histosBs(100);
} ; 

MyHistoFiller histFilling(_entries_to_skip); 
//book the Histo[100] filling with whatever function can be used ,but using a functor which in the operators decide what do to based on the tagged multiple candidates rdfentry
df.Fill( histFilling, {"variable", "_rdfentry", "WeightVectorColumn"}; 

Of course this is a pseudo code and most likely i need to make a functor which acts on Slots and then merge at the end. But i wonder basically, if triggering the event loop more than once, would keep the _rdfEntry numbering preserved, or if i should expect the order to be scrabled.

Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided

Hi Renato,

Association of a given _rdfentry value with a given TTree entry is not stable across multi-thread runs.

Instead of Take-ing booSelPass you should Define it as an RDF column and pass that column to MyHistoFiller's opreator() for each event.


Hi @eguiraud, maybe i explained badly my issue. In practice, in order to identify double candidates and remove them i need to trigger the event loop twice, the first time to get all the unique ids of events, in between tag some to remove, and again to consider only events which passes both the selection and the uniqueness requirement. Therefore i was looking for suggestions on how to best implement it within RDF, possibly triggering event loops twice or even better only once, maybe having a Custom Action doing some filling and at finalization stage cleaning up the returned object?

I still don’t understand why you need to loop twice over the data to not select duplicate events: you can have a stateful (thread-safe) filter expression that remembers which ids it has already seen. That’s to reply to “I was thinking how one can achieve avoiding to Take the columsn something like that.”. Now, depending on the size of the dataset and the number of threads you expect to be running the application on, one thing or the other might be faster.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.