RDataFrme: Sumw2 affected reading speed and other issue

LongHoa · May 5, 2021, 2:06pm

Hello,

I’m using RDataFrame to create histograms from tree. As I have many variable to plot, I use a loop to make the histograms and store them in a dict.

I found that, if I compute the error using Sumw2 after getting EACH histogram, my program is slowed down significantly. The only way to avoid it is completing filling the dict of histogram, then loop over the dict and do Sumw2.

I have attached a test code [1] along with a text input [2], which can create a test root file, read the histogram and compute the error in different ways (like I describe above), so you can reproduce the result. For a small root file (less than 1MB), the different between 2 method is 1 second (plot in [3]). Is it what we should expect?

Another question I have related to the RDataFrame is that, if I add new column using Define (like I did in my test code), it will not work with root version 6.18 and above.

Thank you,
Hoa.

[1]
CodePython_RDataFrame.py (9.3 KB)
[2]
variable.txt (114 Bytes)

[3]

ROOT Version: 6.16
Platform: Ubuntu 18.04, lxplus
Compiler: Not Provided

bellenot · May 5, 2021, 2:09pm

I’m sure @eguiraud will be able to help

Wile_E_Coyote · May 5, 2021, 2:20pm

Somewhere before you create any histogram, execute (no need to call Sumw2 for every one): myroot.TH1.SetDefaultSumw2(True)

eguiraud · May 5, 2021, 2:31pm

Hi,
RDataFrame is lazy: it only runs the event loop and produces the results when you access them for the first time. If you call a method on each histogram after RDF returns it, you run one event loop per histogram. If you call the method at the end, you run a single event loop that fills all histograms, that’s the reason for the performance difference.

In your case however I think what you really want is to call the static method TH1::SetDefaultSumw2(true) to turn on the weight sums automatically on all histograms, see the docs.

Cheers,
Enrico

P.S.
We added several useful features and important performance improvements for large-scale analyses with RDF in recent RDF versions, consider switching from 6.16 to e.g. 6.24.

LongHoa · May 5, 2021, 6:07pm

Thank you very much! It works.

eguiraud · May 5, 2021, 6:29pm

Cool! Keep an eye on the ROOT release notes, we have some improvements planned for RDF+Python

system · May 19, 2021, 6:29pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.