ROOT Version: 6.26/10
Platform: Ubuntu 18.04 on local machine with i5-8250U CPU
Compiler: prebuilt binary
Dear ROOT experts!
I want to ask a question on the RDataFrame
performance when dealing with a lot of small files.
I’ve attached the files that mimic the logic of my actual analysis code. I’ve got processes that are split in multiple sample files containing 3 trees and a histogram with normalization information. For every process I want to
- Select events with slight variations for every tree
- Select branches and compute new variables with slight variations between trees
- Modify the event weight with a distinct normalisation coefficient for every sample
The (3) requirement prevents me from joining all of the samples of one process into a single TChain
. Instead, I need to process every sample individually. This greatly reduces the performance of my code when I need to deal with the processes that consist of multiple small files, even when I’m using the lazy Snapshot()
and RunGraphs()
functionality.
To be noted: this code is to be run on the lxplus
accessing the files on eos
(via root://
as advised here), but the benchmark are done on a local machine since it seems to provide much stable results.
Here’s the code that mimics my overall analysis structure.
ConvertDatasets.py (8.7 KB)
ConvertTree.py (4.3 KB)
RdfHelpers.py (1.7 KB)
The handling of the processes is done with ConvertDatasets.py
that uses ConvertTree.py
as the code for generating Snapshot()
handles. RdfHelpers.py
contain some C++
functions to speed up some of the calculations.
I’ve tested this code on 3 types of processes (that I’m unfortunately not able to share publicly at the moment, but can do in the PMs):
Process 1
Number of files: 9
Input events: 5 917 598.0
Passed events: 192 944.0
Selection efficiency: 3.26%
Process 2
Number of files: 84
Input events: 15 011.0
Passed events: 328.0
Selection efficiency: 2.19%
Process 3
Number of files: 6
Input events: 2 167 265.0
Passed events: 1 117 982.0
Selection efficiency: 51.58%
Process 2 even has some of the trees empty.
Running ConvertDatasets.py
for 1 thread and averaging over 3 runs yields following results:
======>Mean performance<==========
Process 1
Number of files: 9
Input events: 5 917 598.0
Passed events: 192 944.0
Selection efficiency: 3.26%
Setting up handles: 1.7929 +- 1.9828 s
RDF RunGraphs(): 31.7754 +- 2.6518 s
Get Sum() and Count(): 0.0191 +- 0.0261 s
Merge trees: 0.3354 +- 0.0166 s
Total: 33.9228 +- 4.6773 s
Process 2
Number of files: 84
Input events: 15 011.0
Passed events: 328.0
Selection efficiency: 2.19%
Setting up handles: 3.0023 +- 0.3269 s
RDF RunGraphs(): 29.7169 +- 3.1373 s
Get Sum() and Count(): 0.0166 +- 0.0017 s
Merge trees: 0.1429 +- 0.0026 s
Total: 32.8787 +- 3.4599 s
Process 3
Number of files: 6
Input events: 2 167 265.0
Passed events: 1 117 982.0
Selection efficiency: 51.58%
Setting up handles: 0.4955 +- 0.0469 s
RDF RunGraphs(): 59.6083 +- 2.3774 s
Get Sum() and Count(): 0.0004 +- 0.0000 s
Merge trees: 2.0597 +- 1.0171 s
Total: 62.1638 +- 1.3134 s
Total time for ConvertDatasets(): 128.9653 +- 6.8237 s
There are few things to note
- The run time per process depends on the final file size, not on the initial one (this one was a surprise for me). Process 1 initially has more events over 3 trees than Process 3, but in the Process 3 more events pass the filters and thus higher processing time. But I guess it mirrors the solution to my previous question.
- Process 2 while significantly smaller than any other two takes comparable time to process.
When increasing the CPU count to 4 (my machine only has 4 physical cores), the overall time does go down, but the proportion between processes stays the same.
2 threads
======>Mean performance<==========
Process 1
Number of files: 9
Input events: 5 917 598.0
Passed events: 192 944.0
Selection efficiency: 3.26%
Setting up handles: 1.8021 +- 1.9737 s
RDF RunGraphs(): 25.6703 +- 5.9387 s
Get Sum() and Count(): 0.0174 +- 0.0235 s
Merge trees: 0.4043 +- 0.0392 s
Total: 27.8941 +- 7.9750 s
Process 2
Number of files: 84
Input events: 15 011.0
Passed events: 328.0
Selection efficiency: 2.19%
Setting up handles: 3.0808 +- 0.2372 s
RDF RunGraphs(): 31.0479 +- 2.9642 s
Get Sum() and Count(): 0.0176 +- 0.0026 s
Merge trees: 0.1823 +- 0.0414 s
Total: 34.3286 +- 3.2402 s
Process 3
Number of files: 6
Input events: 2 167 265.0
Passed events: 1 117 982.0
Selection efficiency: 51.58%
Setting up handles: 0.8561 +- 0.0197 s
RDF RunGraphs(): 35.4611 +- 2.1408 s
Get Sum() and Count(): 0.0005 +- 0.0000 s
Merge trees: 1.5322 +- 0.0245 s
Total: 37.8498 +- 2.1360 s
Total time for ConvertDatasets(): 100.0725 +- 9.0792 s
4 threads
======>Mean performance<==========
Process 1
Number of files: 9
Input events: 5 917 598.0
Passed events: 192 944.0
Selection efficiency: 3.26%
Setting up handles: 1.7606 +- 1.8464 s
RDF RunGraphs(): 20.1267 +- 3.8619 s
Get Sum() and Count(): 0.0189 +- 0.0252 s
Merge trees: 0.4359 +- 0.0669 s
Total: 22.3421 +- 5.8004 s
Process 2
Number of files: 84
Input events: 15 011.0
Passed events: 328.0
Selection efficiency: 2.19%
Setting up handles: 3.0255 +- 0.3126 s
RDF RunGraphs(): 29.6785 +- 2.1564 s
Get Sum() and Count(): 0.0302 +- 0.0151 s
Merge trees: 0.1906 +- 0.0298 s
Total: 32.9248 +- 2.5140 s
Process 3
Number of files: 6
Input events: 2 167 265.0
Passed events: 1 117 982.0
Selection efficiency: 51.58%
Setting up handles: 0.8742 +- 0.0785 s
RDF RunGraphs(): 26.1213 +- 0.9291 s
Get Sum() and Count(): 0.0004 +- 0.0001 s
Merge trees: 2.1728 +- 0.9996 s
Total: 29.1687 +- 2.0071 s
Total time for ConvertDatasets(): 84.4356 +- 6.3073 s
I have also tried to run all of the Snapshot()
handles in a single RunGraph()
using this code
ConvertDatasetsSingleLoop.py (7.1 KB)
1 thread
Separate RunGraphs() per process: 128.9653 +- 6.8237 s
One RunGraphs() for all processes: 110.3038 +- 5.2638 s
2 threads
Separate RunGraphs() per process: 100.0725 +- 9.0792 s
One RunGraphs() for all processes: 95.7382 +- 4.9013 s
4 threads
Separate RunGraphs() per process: 84.4356 +- 6.3073 s
One RunGraphs() for all processes: 67.3011 +- 5.9421 s
Though there is an improvement comparing to the previous version, the Process 2 still seems to take up way more time than expected. This might not seem much right now, but this code is to be run for every systematic variation resulting in hundreds of calls of ConvertDatastes()
so it will quickly add up.
So my question is whether there is anything that I’ve missed that would help processing this kind of processes in my setup (i.e. unable to create one Snapshot()
for all of the sample files in a process)? Or any further improvement will be incremental while requiring me to abandon simplicity of the python
code for the speed and complexity of C++
Thanks in advance,
Aleksandr