Operation on RDataFrame::Range with large entry IDs is slow?

ZhaoSH · January 7, 2025, 8:22am

ROOT Version: 6.32.06
Platform: Debian GNU/Linux trixie
Compiler: GCC 14.2.0

Hello,

I found an performance issue when I was doing something like

for (int i{}; i<=nEvent; ++i) {
    df.Range(eventEntry[i], eventEntry[i+1]).Foreach(...)
}

The data of an event consists of a continuous set of entries, so I process them like above. But I found that with the entry range growth, it takes much more time to process. For example,

df.Range(0, 10000).Foreach(...)

is much faster than

df.Range(9990000, 10000000).Foreach(...)

though it consists of same number of entries.

Is the performance degradation an expected behavior or not? Or am I doing something wrong here?

bellenot · January 7, 2025, 8:24am

Let see if @vpadulan can help here

vpadulan · January 15, 2025, 4:44pm

Dear @ZhaoSH ,

The Range operation is useful for restricting the processing to a limited number of events, only when using one thread of your machine. Thus, it’s mostly for debugging/exploring purposes rather than for full performance benchmarking. In particular, the effect you are noticing is part of the functionality of Range: the dataset is still traversed event-by-event, from the very beginning, irrespectively of the value of the begin entry you pass to Range. The operation will only take care of making sure that only the values between [begin, end) are actually processed, but the reading will happen nonetheless from the start. Thus, the larger the begin value, the longer it will take for the RDataFrame to read up that entry number.

Cheers,
Vincenzo

ZhaoSH · January 17, 2025, 7:53am

Dear Vincenzo,

Thank you very much for the reply! I think I have understood what’s happening here, so I will try stick on a single Foreach loop.

Cheers,
Shihan

system · January 31, 2025, 7:53am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.