ROOT Version: 6.16/00
Platform: Ubuntu 16.04
Dear ROOT experts,
I’m using pyRoot with RDataFrame for the experiment data processing. I need to do quite basic actions like in the last CERN opendata RDataFrame examples (e.g. [1], [2]):
- Make some event selections.
 - Apply weight corrections based on the sum of MC weighs, cross-section and luminosity.
 - Instead of making histograms, transform variables for further work
 
Each process is split into three periods with different luminosity and each period could be further split into subprocesses with different cross-section and MC weight sum.
However I found out it is a fairly slow way to work with root files. It works especially slow when the process is made up of few subprocesses, each containing only 1-100 events.
It also seem to gradually take all available RAM forcing the OS to kill the conversion process.
This gets very unwieldy when I need to work with systematic errors, because it means I need to do every conversion up to 20 times.
I was advised to make the tree conversion function into the C++ file, add it into the python file with ROOT.gInterpreter.Declare("#include foo.cpp), but that didn’t affect neither the speed nor the memory usage.
I’ve created a mock-up example of my conversion code and attached it to this post.
ConvertDataset.py (5.8 KB)
ConvertTree_cpp.cpp (1.6 KB)
It could also be found here with the C++ code and some example input data. There are two processes, “heavy” and “light”:
- “Heavy” has only 1 file per period with 
64613, 78998and104645entries. - “Light” has 3 files per period with 
(1, 79, 196),(0, 82, 196)and(1, 94, 496)entries 
I’ve run some tests and the runtime for the each of the combination of code and process looks like this (the numbers are runtime in seconds):
python, "heavy"	python, "light"	cpp, 'heavy'	cpp, "light"
15,2174670696	30,719119072	14,1762371063	29,1049787998
12,5415570736	29,5174951553	12,7764348984	29,7464039326
12,3580510616	30,4264831543	12,8252680302	36,1380038261
12,3621308804	31,402148962	12,9889249802	28,4454379082
12,4532442093	30,9754590988	13,0270299911	33,2478058338
12,5667109489	32,0482950211	12,978730917	32,8336589336
12,5887079239	35,2845959663	13,110419035	33,0987081528
12,7872169018	34,1884410381	13,2643208504	31,2558979988
12,6519219875	32,9833519459	13,6302509308	33,3314220905
12,7544009686	32,929363966	13,9931509495	32,5388319492
12,6516251564	32,9172339439	14,0106520653	28,6607189178
12,8050370216	33,8565819263	14,4325330257	29,5839531422
12,7571280003	34,3305761814	13,8887300491	31,912913084
12,9673330784	33,6310958862	13,8388259411	34,5327329636
12,8155889511	33,6902740002	14,2913339138	34,5269930363
13,0519108772	33,7768409252	14,0997800827	34,1296551228
13,0272419453	34,1902039051	13,950273037	35,7440810204
12,9802839756	33,9651200771	13,9921731949	35,3556680679
13,1217548847	33,9447009563	13,8357200623	35,6759448051
12,8862700462	36,3453910351	14,1057138443	42,7853910923
Unfortunately, I don’t know how to properly present the memory usage issue.
I’ve got the following questions:
- Is the way I’m using the RDataFrame correct? Maybe there’s a better way to merge the input data before sending it to the RDataFrame? Right now the problem with having different normalization coefficients for every process file is stopping me from doing so.
 - Is there a way to speed up the conversion times and decrease the memory usage? The way it is now really hinders my ability to do the analysis.
 
Thanks in advance,
Aleksandr
 But after the event loop has run (in your case, 
 ) for your extensive testing and advice.