Accessing entry information using RDataFrame

amvargash · November 15, 2022, 6:35pm

Dear ROOT Experts,

I’m using RDataFrame to populate a TEntryList but I’m not sure how to access the Long64_t entry in addition to the branches present in the TTree.

I’ve defined my lambda as:

   auto fillEntryList = [&](Long64_t entry, UInt_t run, ULong64_t event){
        // "run" and "event are "branches" within a TChain
        // executes TEntryList::Enter
   };

While executing a Foreach:

   d.Foreach(fillEntryList,{"run","event"});

Which understandably throws the following error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  3 column names are required but 2 were provided: "run", "event".

How do I need to modify my lambda (or Foreach call) in such a way that is able to capture the current entry that’s being processed (as in TTreeReader::GetCurrentEntry())?

Thank you!

amvargash · November 16, 2022, 2:01am

   d.Foreach(fillEntryList,{"rdfentry_","run","event"});

According to https://root.cern.ch/doc/master/df001__introduction_8C_source.html

bellenot · November 16, 2022, 8:02am

So you found the solution?

eguiraud · November 16, 2022, 8:17am

Yep, that’s it. Note however, as per the docs, that in multi-thread runs over multiple trees in a TChain rdfentry_ will not always correspond to the global entry number in the chain, so the TEntryList filling is only safe for single-thread runs (without EnableImplicitMT). I hope to remove this limitation in the future.

amvargash · November 16, 2022, 2:05pm

If I still want to use multi-threading, is there any way around it?

eguiraud · November 17, 2022, 11:34am

Hi @amvargash ,

The auto-generated rdfentry_ column has the limitation mentioned above (at least for now), but nothing stops you from having an actual column (maybe in a friend tree) that contains the desired event number: as a workaround, just once, you can run a single-thread program that produces a column with the global chain event number and store it in a tree:

ROOT.RDataFrame(original_chain.GetEntries())\
  .Alias("GlobalEventNumber", "rdfentry_")\
  .Snapshot("event_numbers", "event_numbers.root", ["GlobalEventNumber"])

and for any further processing (including multi-thread processing) you can now add event_numbers as a friend of the main chain and you will have the column GlobalEventNumber with the right value for every event.

Cheers,
Enrico

Wile_E_Coyote · November 17, 2022, 12:02pm

@eguiraud You assume that exactly the same set of ROOT files will always be used, and they will be loaded / processed in exactly the same order. If the chain changes in any way, the stored GlobalEventNumber will be meaningless.

eguiraud · November 17, 2022, 12:06pm

No, I am saying that as a workaround for this limitation (that we want to lift in the future) you can add this extra step to the analysis pipeline, which as you correctly point out will have to be re-executed whenever what is in the input TChain changes (“once” above is “once per input dataset”).

system · December 1, 2022, 12:07pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.