RDataFrame performance for branch reduction and event filtering (slimming and skimming)

apetukho · November 24, 2022, 2:50pm

Dear ROOT experts,

I’m trying to perform the following operations of the ntuples of my analysis:

Select subset of branches (i.e slimming)
Modify the weights to include the normalization coefficients
Perform event selection (i.e. skimming)

I’ve previously had problems processing multiple small files using RDataFrame, but now I’ve noticed some strange performance on the big files as well and want to ask some questions about it.

The scripts I’m using can be found in this cernbox folder.
CreateInput.py creates a 4.2 GB file with 359 branches similar to that I have in the real analysis. The file itself is also included in that folder.
RDFBenchmarkDummy.py checks the time to run the analysis code in 3 ways:

With just slimming
With slimming and weight modification.
With slimming, weight modification and event selection set up to select ~5% of the events.

The time is averaged over 3 tries.

I’m getting the following results on my machine

(1)
[42.19031023979187, 40.98794865608215, 38.83463215827942]
Snapshot:	 40.67+-1.70 s

(2)
[45.55086946487427, 48.206947326660156, 47.1492977142334]
Snapshot:	 46.97+-1.34 s

(3)
[9.084217071533203, 8.745355606079102, 8.877978801727295]
Snapshot:	 8.90+-0.17 s

I’ve compared the results to the output of the root-readspeed run with the following command that accesses the same branches used in the RDFBenchmarkDummy.py:

./root-readspeed --trees tree_3lCR_PFLOW --files Heavy_d.root --branches leading_pT_lepton subleading_pT_lepton n_bjets n_jets event_3CR event_type met_tst met_signif dMetZPhi MetOHT dLepR Z_pT M2Lep mT_ZZ Z_rapidity frac_pT sumpT_vector sumpT_scalar ZpTomT weight weight_EL_EFF_ID_CorrUncertaintyNP0__1down weight_EL_EFF_ID_CorrUncertaintyNP0__1up weight_EL_EFF_ID_CorrUncertaintyNP10__1down weight_EL_EFF_ID_CorrUncertaintyNP10__1up weight_EL_EFF_ID_CorrUncertaintyNP11__1down weight_EL_EFF_ID_CorrUncertaintyNP11__1up weight_EL_EFF_ID_CorrUncertaintyNP12__1down weight_EL_EFF_ID_CorrUncertaintyNP12__1up weight_EL_EFF_ID_CorrUncertaintyNP13__1down weight_EL_EFF_ID_CorrUncertaintyNP13__1up weight_EL_EFF_ID_CorrUncertaintyNP14__1down weight_EL_EFF_ID_CorrUncertaintyNP14__1up weight_EL_EFF_ID_CorrUncertaintyNP15__1down weight_EL_EFF_ID_CorrUncertaintyNP15__1up weight_EL_EFF_ID_CorrUncertaintyNP1__1down weight_EL_EFF_ID_CorrUncertaintyNP1__1up weight_EL_EFF_ID_CorrUncertaintyNP2__1down weight_EL_EFF_ID_CorrUncertaintyNP2__1up weight_EL_EFF_ID_CorrUncertaintyNP3__1down weight_EL_EFF_ID_CorrUncertaintyNP3__1up weight_EL_EFF_ID_CorrUncertaintyNP4__1down weight_EL_EFF_ID_CorrUncertaintyNP4__1up weight_EL_EFF_ID_CorrUncertaintyNP5__1down weight_EL_EFF_ID_CorrUncertaintyNP5__1up weight_EL_EFF_ID_CorrUncertaintyNP6__1down weight_EL_EFF_ID_CorrUncertaintyNP6__1up weight_EL_EFF_ID_CorrUncertaintyNP7__1down weight_EL_EFF_ID_CorrUncertaintyNP7__1up weight_EL_EFF_ID_CorrUncertaintyNP8__1down weight_EL_EFF_ID_CorrUncertaintyNP8__1up weight_EL_EFF_ID_CorrUncertaintyNP9__1down weight_EL_EFF_ID_CorrUncertaintyNP9__1up

And have got the following output, once again after 3 tries:

1 thread(s), real time: 6.46+-0.19

I’ve got the following questions:

Why does the simple slimming takes 6 times more real time than the root-readspeed estimate? I would expect it to be 2.5-3 times more (one time to read, one time to write plus some more time for the graph compilation) but not that much
While this reproducer is only showing difference of 1.36 with the root-readspeed I’m actually getting the difference of 1.7-2 times with the real datafiles. Since the data is read only once (i.e. no taking advantage of caching) and stored on the remote filesystem (eos) due to it size, the performance is bottlenecked with the read/write speed, so even the factor of 2 takes makes the conversion process very long.
(a) What could be the difference with the real files compared to this? The compression is actually better for the real files.
(b) If the slimming is sped up is it possible to speed up the final (3) version of the code?

Thanks in advance,
Aleksandr

ROOT Version: 6.26/10
Platform: Ubuntu 18.04 on local machine with i5-8250U CPU and SSD drive
Compiler: prebuilt binary

eguiraud · November 24, 2022, 3:27pm

Hi @apetukho ,

before diving into this (I’ll try to find some time tomorrow ), you mentioned the data is stored on EOS. Are you reading it as /eos/user/e/.../yourfile.root or as root://.../yourfile.root? The latter should be much faster.

Cheers,
Enrico

apetukho · November 24, 2022, 3:38pm

Hi @eguiraud ,

Thank you for looking into it. The code in this topic is uses the local files, since it was hard to get a stable estimate on with the code/root-readspeed on lxplus and the files on the eos (which is the real situation since there’s a from 65 GB to 2 TB files to slim/skim). But I do use the /eos/... way, so I’ll try switching to the root://... one. Why is it faster, though?

Best,
Aleksandr

eguiraud · November 24, 2022, 3:50pm

/eos goes through the network-mounted filesystem, root:// goes through the xrootd protocol. The latter is specialized for reading ROOT files and performs less calls and just for the necessary bytes. Spread the word! Future ROOT versions will actually switch to root:// under the hood by default, but not v6.26. For now you can check how much difference it makes using root-readspeed (which will be included in ROOT v6.28, by the way).

About your questions, it will be hard to answer to 2.a without doing a comparison study in similar conditions to those in which you actually run your code, but I’ll see whether I can figure out what’s going on for 1. and 2.b. It could just be that things are working as intended, RDF is known to have a certain overhead compared to root-readspeed (the price of nice features) – but it could also be that there are some easy performance gains possible.

Cheers,
Enrico

eguiraud · November 28, 2022, 1:21pm

Hi Aleksandr,

I took a look at the reproducer you shared and could not see anything out of the ordinary. As I mentioned RDF has some overhead on top of rootreadspeed, and then on top of that the RDF slimming simply does more than rootreadspeed: it doesn’t only read but it also processes data and writes it out again. Also at least in this case writing actually takes longer than reading, and most of the writing time is spent compressing the data, so you can speed up the slimming by picking a faster compression algorithm, e.g. adding opts.fCompressionAlgorithm = ROOT.ECompressionAlgorithm.kZSTD to the snapshot options I see runtimes go down from ~30s to ~20s.

Cheers,
Enrico

P.S.
I would be curious to know what kind of speed-up using root:// instead of /eos gives you

apetukho · November 28, 2022, 2:16pm

Hi Enrico,

Thank you for looking into it and for the detailed explanation! I guess this is the best I can do with a large file processing, so I better switch to other problems like processing multiple small files. Even with collecting multiple Snapshot handles and using RunGraphs I get a run time not proportional to the size of the files, so expect another thread soon
Still, where can I learn more about the available compression algorithms and their pros and cons?

Unfortunately, the XRootD tests have only shown that I have no idea how the lxplus + eos combination works or how to test it properly. I’ve ran the same root-readspeed commands from a clean login 3 times for the same setup with 1 thread and got the following results for the real time:

/eos/
[83.4, 43.8, 54. 8] s
60.6 +- 20.5 s

root://
[65.1, 26.2, 69.5] s
53.6 +- 23.8 s

I hope the improvement will be more clear in the full data processing.

eguiraud · November 28, 2022, 2:33pm

Mmmh you are dominated by noise and/or other latencies. root:// should consistently demonstrate the smallest minimum runtimes though.

@pcanal might be able to point you to some docs. The available algorithms are kZLIB, kLZMA, kLZ4, kZSTD. My understanding is the following: LZ4 is faster but compresses less well, LZMA is slower but compresses more, ZLIB and ZSTD are a bit of a middle ground but ZSTD is newer and typically closer to LZ4 speed while still having good enough compression.

Just a wild guess, but if the files are very small, the overhead of opening and closing the file might become important w.r.t. the actual event loop time. I’m happy to take a look though, “runtime does not scale linearly with number of events” is something that we should have a good explanation for Certainly the expectation is that it is!

Cheers,
Enrico

system · December 12, 2022, 2:34pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.