RDataFrme: Sumw2 affected reading speed and other issue

Hello,

I’m using RDataFrame to create histograms from tree. As I have many variable to plot, I use a loop to make the histograms and store them in a dict.

I found that, if I compute the error using Sumw2 after getting EACH histogram, my program is slowed down significantly. The only way to avoid it is completing filling the dict of histogram, then loop over the dict and do Sumw2.

I have attached a test code [1] along with a text input [2], which can create a test root file, read the histogram and compute the error in different ways (like I describe above), so you can reproduce the result. For a small root file (less than 1MB), the different between 2 method is 1 second (plot in [3]). Is it what we should expect?

Another question I have related to the RDataFrame is that, if I add new column using Define (like I did in my test code), it will not work with root version 6.18 and above.

Thank you,
Hoa.

[1]
CodePython_RDataFrame.py (9.3 KB)
[2]
variable.txt (114 Bytes)

[3]

ROOT Version: 6.16
Platform: Ubuntu 18.04, lxplus
Compiler: Not Provided


I’m sure @eguiraud will be able to help

1 Like

Somewhere before you create any histogram, execute (no need to call Sumw2 for every one): myroot.TH1.SetDefaultSumw2(True)

1 Like

Hi,
RDataFrame is lazy: it only runs the event loop and produces the results when you access them for the first time. If you call a method on each histogram after RDF returns it, you run one event loop per histogram. If you call the method at the end, you run a single event loop that fills all histograms, that’s the reason for the performance difference.

In your case however I think what you really want is to call the static method TH1::SetDefaultSumw2(true) to turn on the weight sums automatically on all histograms, see the docs.

Cheers,
Enrico

P.S.
We added several useful features and important performance improvements for large-scale analyses with RDF in recent RDF versions, consider switching from 6.16 to e.g. 6.24.

1 Like

Thank you very much! It works.

Cool! Keep an eye on the ROOT release notes, we have some improvements planned for RDF+Python :smiley:

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.