Filtering a TTree (via RDataFrame) with indexed branches

efernandes · December 21, 2022, 5:25pm

Dear experts,

I have a few related questions. Before asking them, a bit of context. I have a TTree with many leaves/objects (more than one billion) and just a few branches. My branches store categorical data (I’m using strings now but I can convert them to integer codes). I want to query this large tree for a few objects based on the values in the branches. When I do this using RDataFrame.Filter for each object (something like b1==“a” && b2==“b” && b3==“c”), it is taking too long. I need to speedup this. So what I have considered up to now and want to make some questions about is the following:

Is TTree::BuildIndex(…) adequate to speedup queries? More specifically, would RDataFrame::Filter() use it automatically? If yes, can I make an index based only on one branch? I suppose I cannot because I have seen at some documentation that an index must give unique keys for each leaf. In my case, this won’t happen.
In case the idea above does not work (as I suspect), is there any functionality in ROOT that can be used to index a tree in order to speedup the Filter method?
Another idea I had was to make my data “wide”. Since ROOT is good to work with part of the branches, I could create a branch for each possible value of my first branch (b1, for instance). I would end up with a few million branches (!!!). Does ROOT support so many branches? Is there a practical limit for the number of branches?
If the idea above is feasible, does it make sense in order to speedup the Filter method?

Best regards,
Eraldo.

couet · January 5, 2023, 10:22am

Welcome to the ROOT forum.

I guess @eguiraud can help.

eguiraud · January 5, 2023, 4:15pm

Hi Eraldo,

welcome to the ROOT forum and sorry for the high latency (Christmas break ).

I guess you mean more than one billion entries (aka events).

Ok, let’s start from here: what’s “too long”? How many entries per second are we talking about? You can activate RDF execution logs to retrieve the setup time and the running time of RDataFrame (I recommend using v6.26.10 for this). Other important info: what ROOT version is this with? Are you using Python, compiled C++ or interpreted ROOT macros? How many cores?

And the most important thing with RDataFrame: are you sure you are running all your queries in a single event loop (by booking everything you want to do first, and accessing any of the booked results later) rather than running a loop over data for each Filter? You can check the number of event loops a dataframe has run with df.GetNRuns() – ideally it should be 1.

About your other questions:

BuildIndex is used to access entries of a second tree, indexing them based on some values in the main tree, so it does not apply here as far as I understand. It does not have anything to do with the SQL concept of indexing.
In general, for each query, you can run it once and store the entry numbers of the entries that pass the selection in a TEntryList. Subsequently you can use TTree+TEntryList and the tree will only loop over the entries listed in the TEntryList. If you run many queries at a time, though, each will have a different entry list, and this method becomes impractical.
The TTree format is really designed to be used like you are using it already, I doubt the performance with millions of branches will be better.

I propose to clarify what slow means in this case and how many event loops you are performing, then move from there. See also “Executing multiple actions in the same event loop” at ROOT: ROOT::RDataFrame Class Reference .

Cheers,
Enrico

efernandes · January 11, 2023, 9:19am

Exactly!

Ok. That is a bunch of valuable information for me. I’ll investigate those and follow up here if necessary.

eguiraud:

About your other questions:

BuildIndex is used to access entries of a second tree, indexing them based on some values in the main tree, so it does not apply here as far as I understand. It does not have anything to do with the SQL concept of indexing.

In general, for each query, you can run it once and store the entry numbers of the entries that pass the selection in a TEntryList. Subsequently you can use TTree+TEntryList and the tree will only loop over the entries listed in the TEntryList. If you run many queries at a time, though, each will have a different entry list, and this method becomes impractical.

The TTree format is really designed to be used like you are using it already, I doubt the performance with millions of branches will be better.

I propose to clarify what slow means in this case and how many event loops you are performing, then move from there. See also “Executing multiple actions in the same event loop” at ROOT: ROOT::RDataFrame Class Reference .

Thank you a lot, @eguiraud !