Creating TEntryList with multithreaded RDataFrame

Hello,

I was excited to read that RDataFrames constructed with TChains that have a TEntryList loaded should respect that entry list (although the old forum threads about this imply there may be some caveats here, which I will need to study in more detail). But what I wanted to check is if there’s a way to create the TEntryList in a multithreaded run?

The situation is I have data spread over many files, and am currently processing it with RDataFrame from a TChain. However, I frequently want to run again on the data but just plot something different and not change the event selection. What would be ideal would be when I do my first run over all the data I can build a TEntryList, save it somehow (to a file - its a lot of data so I would have imagined I’d need to utilise ROOT: TEntryListFromFile Class Reference for this as I worry the full TEntryList may be too big for memory) and then in subsequent runs I could process a chain with this entrylist.

But am I correct in thinking I cannot build a TEntryList in a multithreaded run because the entry number isn’t available reliably? I saw there is DefineSlotEntry method but assumed the entry number there is a thread-local entry number rather than the input data entry number (why isn’t it possible to get that global entry number in the thread?). So am I correct to think I’d have to do a single-thread processing to build the TEntryList with a custom Action (is there an example available of this?) and then I can process in parallel?

Just to add - since the data is spread over lots of files, and I was thinking I would need to use TEntryListFromFile, and hence have a separate TEntryList for each file of the chain, if there was a way to process multiple files with multithread but each thread takes care of a single file (no sharing file between threads) then I could build my entrylists that way, but I dont think that mode of event looping is supported is it?

Happy to hear otherwise and that what I want to do is possible after all?

Thanks!
Will

I think @eguiraud can help.

Hi @will_cern ,
sorry for the high latency, I was off last week.

That’s correct.

Yes. You might also use a simple Foreach instead of a custom action, e.g. df.Foreach([&elist](ULong64_t e) { elist.Enter(e); }, {"rdfentry_"}).

I am afraid that’s not supported. Building a proper TEntryList for multiple files from an RDF event loop is tricky, because it requires reacting to switches in the input tree. I’ll see if I can cook something up in the next days to get you started (probably not before Thursday, sorry).

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Hi,
I have not forgotten about this, it turns out that we are missing a couple of ingredients to implement this properly (e.g. being able to access the local tree entry number from RDF and being able to enter entries in a TEntryList specifying the tree by name/path rather than by pointer), so I have to implement those first :slight_smile:

Thanks for the update. I think if you manage to make this work it could be a very nice feature as it will let me build and save event selections to entrylists so that I can rerun my processings in a much faster loop on subsequent runs, without having to restrict myself to single thread to build the initial entrylist

Sure! One piece of the puzzle is already in, work in progress :slight_smile: