Output overhead from Snapshot method in RDataframe with parallel processing?

alcaraz · November 15, 2022, 10:11am

While writing filtered events from a NanoAOD of the CMS experiment I have found that the Snapshot method writes an excessive overhead when using parallel processing. I am attaching a simple example below. The overhead actually increases with the number of cores used in parallel. I have tried many different ROOT versions, including recent ones. The total number of events is 1831637, and the number of events written is 1750, Is this a bug or a (not nice) known feaure? Thanks.

_ROOT Version: 6.24/06
_Platform: Linux Centos7 CERN
_Compiler: gcc 8.3.0, Python 3.9.6

Size of output file:
1 core: 3.9 MB
4 cores: 22 MB
8 cores: 37 MB

skim_MuMu_minimal.py (936 Bytes)

eguiraud · November 15, 2022, 11:30am

Hi @alcaraz ,

thank you for the report and the reproducer, this is not a known issue, I will take a look as soon as possible.

Cheers,
Enrico

eguiraud · November 15, 2022, 1:05pm

P.S.

any way you can share the input file with me, even privately? I don’t have access.

eguiraud · November 16, 2022, 1:15pm

Update: I can reproduce the issue also with the current master branch, I’ll let you know when I have more details. Thank you for reporting this!

eguiraud · November 18, 2022, 4:22pm

Hi @alcaraz ,

the situation is the following:

the output file has ~1500 branches and only 1750 events
for each batch (basket) of branch values written out, there is a cost of 70-100 bytes per branch occupied by I/O metadata (the corresponding TKey)
by default RDataFrame (or better TTreeProcessorMT, the engine which RDF uses for multi-thread event loops) creates 10 tasks per worker thread, meaning it splits the inputs ~10 times per worker thread, so more threads → finer-grained splitting
Snapshot must write out at least a batch of branch values per input split/task

So with more threads, 2 effects contribute to the larger output file size:

more threads → more tasks → more (and smaller) output batches per branch → more TKey overhead: e.g. with your input I get ~15k output batches/baskets (roughly 10 per branch) with one thread, and ~60k with 4 threads. Given a rough estimate of 80 bytes per TKey, that’s ~3.5 extra MBs in the output file
more threads → more tasks → smaller batches written out → in your case, with so few output entries, very small batches written out → horrible compression ratios. With your example file, the average compression ratio of a basket in the output file when using 1 thread is 2.2, when using 4 threads it’s 1.3

There are several mitigations:

the problem is only this severe with such a small number of output events, where the output batches/baskets are small w.r.t. their TKey size and don’t compress well. Larger outputs won’t see anything so dramatic
the situation should be better with ROOT v6.26 where we reduced the default number of splits per thread
you can manually reduce the number of tasks per thread: the problem goes away in your example by adding ROOT.TTreeProcessorMT.SetTasksPerWorkerHint(int((4 + ncores - 1) / ncores)) before the EnableImplicitMT call, which sets the total number of tasks to 4 independently of the number of threads (you can probably come up with a better magic number for your particular case – in general it’s better to have a few tasks per thread)

I hope this helps!
Cheers,
Enrico

alcaraz · November 21, 2022, 9:13am

Hello @eguiraud,

Thanks for the fast reaction and for the fix.

If I understand correctly the overhead is effectively caused by the huge number of branches present in the NanoAOD. I believe that it was done in CMS for “user friendly” reasons (accessing trivially the meaning of all trigger bits), although it is “technically” sub-optimal for a small number of events (<~10000). It seems to me from some additional tests that at the end of the day the optimal number of cores and tasks to reduce both processing time and output file size will depend on the details of the input file structure and the expected size of the output ROOT files. In this particular case 1 task/worker seems to be optimal for multicore parallel processing.

Best,
Juan

eguiraud · November 21, 2022, 9:46am

Yes, more precisely it’s the combination of many branches and few entries.

system · December 5, 2022, 9:46am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.