I found an performance issue when I was doing something like
for (int i{}; i<=nEvent; ++i) {
df.Range(eventEntry[i], eventEntry[i+1]).Foreach(...)
}
The data of an event consists of a continuous set of entries, so I process them like above. But I found that with the entry range growth, it takes much more time to process. For example,
df.Range(0, 10000).Foreach(...)
is much faster than
df.Range(9990000, 10000000).Foreach(...)
though it consists of same number of entries.
Is the performance degradation an expected behavior or not? Or am I doing something wrong here?
The Range operation is useful for restricting the processing to a limited number of events, only when using one thread of your machine. Thus, it’s mostly for debugging/exploring purposes rather than for full performance benchmarking. In particular, the effect you are noticing is part of the functionality of Range: the dataset is still traversed event-by-event, from the very beginning, irrespectively of the value of the begin entry you pass to Range. The operation will only take care of making sure that only the values between [begin, end) are actually processed, but the reading will happen nonetheless from the start. Thus, the larger the begin value, the longer it will take for the RDataFrame to read up that entry number.