Dear experts,
we are setting up our framework using roodataframes to study the dihiggs production in its decay of two b jets and two taus. The code we use runs on “flat” ntuples and creates a bunch of histograms in all kind of different regions. It runs all fine however we have a few concerns about the memory usage. We lately added a few more regions, making the roodataframes a bit more complicated and this increased our memory consumption taking 10-20 GB of virtual memory and needing ~20 minutes to JIT. This is currently without any systematics which we still need to include. Our worry is that including this will increase memory to such an extent that it is not usable anymore. I will quickly try to summarize the workflow we currently have. After could you give us any advice on how to reduce the memory consumption?
We use rdfs in root version 6.30.02 in python however have compiled function in c++, which we use in the Define calls of rdf. The first thing we do is create RooDataSetSpecs for each different kind of process, eg. data/signal/background1/background2/… This will lead us to a total of ~30 different types of processes. For all of these processes we create an rdf analysis where we apply filters to create new regions, calculate new variables and book histograms in all these different regions. Once we have a rdf for all the different processes we use the RunGraphs function on multiple cores on all the different rdfs we created. We understand that RunGraphs is smart enough to understand that different regions for the same process is no “new” code and can do all this in the background. Here the code is also JITTED and where the 20 GB of virtual memory is being produced.
We understand that this is quite a complicated analysis, however it is very much a realistic scenario. We are still adding things which means that the rdf operation will become even more complex, leading to more memory consumption. If interested we can provide more information on what we actually do. For now do you have any tips to reduce the memory?
Thank you very much,
Jordy Degens
ps: we are aware of this PR that should reduce memory consumption: