ROOT Version: 6.16/00
Platform: Ubuntu 16.04
Dear ROOT experts,
I’m using pyRoot with RDataFrame for the experiment data processing. I need to do quite basic actions like in the last CERN opendata RDataFrame examples (e.g. [1], [2]):
- Make some event selections.
- Apply weight corrections based on the sum of MC weighs, cross-section and luminosity.
- Instead of making histograms, transform variables for further work
Each process is split into three periods with different luminosity and each period could be further split into subprocesses with different cross-section and MC weight sum.
However I found out it is a fairly slow way to work with root files. It works especially slow when the process is made up of few subprocesses, each containing only 1-100 events.
It also seem to gradually take all available RAM forcing the OS to kill the conversion process.
This gets very unwieldy when I need to work with systematic errors, because it means I need to do every conversion up to 20 times.
I was advised to make the tree conversion function into the C++ file, add it into the python file with ROOT.gInterpreter.Declare("#include foo.cpp)
, but that didn’t affect neither the speed nor the memory usage.
I’ve created a mock-up example of my conversion code and attached it to this post.
ConvertDataset.py (5.8 KB)
ConvertTree_cpp.cpp (1.6 KB)
It could also be found here with the C++ code and some example input data. There are two processes, “heavy” and “light”:
- “Heavy” has only 1 file per period with
64613, 78998
and104645
entries. - “Light” has 3 files per period with
(1, 79, 196)
,(0, 82, 196)
and(1, 94, 496)
entries
I’ve run some tests and the runtime for the each of the combination of code and process looks like this (the numbers are runtime in seconds):
python, "heavy" python, "light" cpp, 'heavy' cpp, "light"
15,2174670696 30,719119072 14,1762371063 29,1049787998
12,5415570736 29,5174951553 12,7764348984 29,7464039326
12,3580510616 30,4264831543 12,8252680302 36,1380038261
12,3621308804 31,402148962 12,9889249802 28,4454379082
12,4532442093 30,9754590988 13,0270299911 33,2478058338
12,5667109489 32,0482950211 12,978730917 32,8336589336
12,5887079239 35,2845959663 13,110419035 33,0987081528
12,7872169018 34,1884410381 13,2643208504 31,2558979988
12,6519219875 32,9833519459 13,6302509308 33,3314220905
12,7544009686 32,929363966 13,9931509495 32,5388319492
12,6516251564 32,9172339439 14,0106520653 28,6607189178
12,8050370216 33,8565819263 14,4325330257 29,5839531422
12,7571280003 34,3305761814 13,8887300491 31,912913084
12,9673330784 33,6310958862 13,8388259411 34,5327329636
12,8155889511 33,6902740002 14,2913339138 34,5269930363
13,0519108772 33,7768409252 14,0997800827 34,1296551228
13,0272419453 34,1902039051 13,950273037 35,7440810204
12,9802839756 33,9651200771 13,9921731949 35,3556680679
13,1217548847 33,9447009563 13,8357200623 35,6759448051
12,8862700462 36,3453910351 14,1057138443 42,7853910923
Unfortunately, I don’t know how to properly present the memory usage issue.
I’ve got the following questions:
- Is the way I’m using the RDataFrame correct? Maybe there’s a better way to merge the input data before sending it to the RDataFrame? Right now the problem with having different normalization coefficients for every process file is stopping me from doing so.
- Is there a way to speed up the conversion times and decrease the memory usage? The way it is now really hinders my ability to do the analysis.
Thanks in advance,
Aleksandr