Accessing individual entries with RDataFrame

FoxWise · March 12, 2024, 6:23pm

Hi all,

Is there a way to check a single event/entry/raw with RDataFrame?
I want something like this:

df = RDataFrame("tree", "file.root")
my_lovely_entry = df.Filter("rdfentry_ == 42").AsNumpy()

But looking at the example above, it seems to loop over all entries.
I have a big RDataframe with ~ 1e7 entries.
Thus, I am wondering if there is any faster way to do the same.

Danilo · March 12, 2024, 7:43pm

Hi Bohdan,

Thanks for the interesting post.
I think a way in which you can efficiently “pre-filter” your large dataset is to use the Range transformation. Have you tried that?

I hope that helps.

Cheers,
Danilo

FoxWise · March 13, 2024, 1:45pm

Hi @Danilo,

Now, I have tried that.

# ROOT.EnableImplicitMT()
start_time = time.time()
shower = df.Range(42, 42+1).AsNumpy(myColumns)
print(f"Range method: {time.time() - start_time:.2f} sec")
start_time = time.time()
shower = df.Filter("rdfentry_ == 42").AsNumpy(myColumns)
print(f"Filter method: {time.time() - start_time:.2f} sec")

I consistently get the Filter method faster
Funny, if I test Filter alone with Multithreading on it gets slower

Range method: around 6.19 sec
Filter method: around 4.69 sec
Filter method with ROOT.EnableImplicitMT() on a machine with 20 cores: around 6.7 - 7 sec
Filter method with ROOT.EnableImplicitMT(4) gives the same: around 6.7 - 7 sec

I have run the test 3-5 times, and fluctuations seem to be ±0.1 sec

So, my conclusion so far is to use Filter and disable multithreading. Again, this puzzles me as the most optimal method.

Danilo · March 13, 2024, 6:51pm

Hi,

I think the reason is that with Range, you read much less data and jump directly to the region of the dataset you need, avoiding decompressing and other overheads.

Happy the solution worked for you!

Cheers,
Danilo

FoxWise · March 13, 2024, 7:02pm

To ensure we are on the same page:

Using Range turned out to be slower by 50% than Filter…

It is acceptable for me to wait 4 or 6 seconds anyway.
But I was wondering if there any faster way.

Using Range only slows things down

system · March 27, 2024, 7:02pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.