Increase accuracy of leaves when using RDataFrame.Mean

Dear ROOT-ers,

I’m reading a ROOT tree with Float_t leaves. I use RDataFrame for that:

dr_th_no_cut = df.Mean["double"]("displacement.dr_proj_th_cm")

My idea is that I don’t need large precision for each leaf, thus saving space. But to calculate the mean, I would like to increase the precision of the result to avoid numerical errors. Unfortunately I get an error with that code (see the first line):

Error in TTreeReaderValueBase::CreateProxy(): Leaf of type Float_t cannot be read by TTreeReaderValue.
Traceback (most recent call last):
File “…/read_phys_file.py”, line 46, in
print(“no cut, cuts:”, dr_th_no_cut.GetValue(), dr_th_wcuts.GetValue())
~~~~~~~~~~~~~~~~~~~~~^^
cppyy.gbl.std.runtime_error: const double& ROOT::RDF::RResultPtr::GetValue() =>
runtime_error: An error was encountered while processing the data. TTreeReader status code is: 6

Can RDataFrame support increasing accuracy for such aggregation operations? Is there a good workaround to solve that?

I’m using ROOT 6.30 at the moment. Thank you.

Maybe with Redefine (or Define, for new extra columns), but the result seems to be the same, so maybe it doesn’t matter if it’s float originally, ROOT will use double precision (or it’s not really converting to double? an RDataFrame expert may clarify)?

With this:

import ROOT

d = ROOT.RDataFrame("ntuple","hsimple.root")
print(d.Describe())
print('Mean px:',d.Mean("px").GetValue())
print('Mean py:',d.Mean("py").GetValue())

d2 = d.Redefine("px","(double)px").Redefine("py","(double)py")
print(d2.Describe())
print('Mean px:',d2.Mean("px").GetValue())
print('Mean py:',d2.Mean("py").GetValue())

I get

...
Column  Type    Origin
------  ----    ------
i       Float_t Dataset
px      Float_t Dataset
py      Float_t Dataset
pz      Float_t Dataset
random  Float_t Dataset

Mean px: -0.0038264499006807457
Mean py: -0.0032243128226821954

...
Column  Type    Origin
------  ----    ------
i       Float_t Dataset
px      double  Define
py      double  Define
pz      Float_t Dataset
random  Float_t Dataset

Mean px: -0.0038264499006807457
Mean py: -0.0032243128226821954
1 Like

Perhaps @vpadulan can help here

@ynikitenko I think I have good news for you:

Although the column being read might be float, the internal accumulators used to calculate the mean are always double. Does this solve this topic?

1 Like

Even better, in 2022, RDF was updated to use Kahan sums to compute the running sums.

1 Like

I also added a small comment to the documentation, so the knowledge can be found by users:

Many thanks! Yes, it is even better than I imagined :slight_smile:

Looking forward for the information to appear in docs for Mean.

I suppose, this also relates to Sum, Stats,… and probably many other methods. I peeked into the source code for Mean, and Kahan summation is not evident at all (though double is more visible).

Could you also update that knowledge please? Maybe there could a common section on precision?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.