RDataFrame seems too conservative about spawning new threads

I’m using RDataFrame to build a data analysis chain, and each event requires a very time-consuming kinematic fit. In this situation, having more threads available would significantly improve overall performance.

However, during testing I found that when the total number of events is not large enough, RDataFrame appears to run in a single thread and does not spawn additional threads at all (even with ROOT::EnableImplicitMT() enabled).

I understand that this behaviour is generally reasonable when per-event processing is fast, since the overhead of creating threads and managing locks can outweigh the benefits. RDataFrame likely avoids multithreading in such cases to reduce unnecessary overhead.

But is there any way to bypass this and force RDataFrame to always use the maximum available concurrency, regardless of how many entries are present?

I guess @vpadulan can help.

Dear @karuboniru ,

Thank you for reaching out to the forum!

I may be missing some details related to your situation, but I’ll still try to give some consideration. From the following line

I found that when the total number of events is not large enough,

I infer that you are using RDataFrame to process a single file, with one dataset inside, that has a small amount of entries. The ROOT dataset is internally partitioned in “groups of events”, so called clusters, which are compressed independently to disk. For the TTree data format, the default size of a cluster is 30 MB. This is a physical characteristic of the file, nothing can read more than one cluster in parallel at a time (efficiently), including RDataFrame. At the extreme, it means that if your dataset has only one cluster, then you will never be able to process with more than one thread, there is simply no work to do.

Given the above, can you describe your situation in more details? How many files? Size of each file? Snippet of code you’re using?

Cheers,

Vincenzo

Exactly, I am now working on a small dataset, single file less than 60MiB. So I think I am facing the exact situation you mentioned. So the only workaround is to split the file into smaller ones (like count = nproc) and pass to RDataFrame?

Dear @karuboniru ,

You can use Snapshot to save a new dataset that has very small clusters, e.g.

opts = ROOT.RDF.RSnapshotOptions()
opts.fAutoFlush = N # Choose the number of events you want in each cluster
df.Snapshot("mydataset", "myfile.root", "", opts)

And then when you process that file you’ll be able to parallelise more (depending on what number of events you chose).

That being said, I find it hardly believable that a file of only 60 MiB can benefit at all from parallelism. I would be very curious to take a look at the code. In case it’s public, would you mind sharing a link to it?

Cheers,

Vincenzo

1 Like

Yes, my code involves a kinematics fit on top of a toy MC to do benchmark, ( pdk2025/record_pdk.cxx at kf_augmented · karuboniru/pdk2025 · GitHub ) and each fit ( pdk2025/src/kf.cxx at kf_augmented · karuboniru/pdk2025 · GitHub ) would take ~ ms of time to finish.

Dear @karuboniru ,

Thanks for the pointer! I understand, it’s quite an involved operation. Let me know if you see any improvement with the smaller clusters.

Cheers,

Vincenzo

1 Like

Yes, now the code works exactly as I’d expect.

1 Like