Simplier way to get full statistics of a RVec branch using RDataFrame?

Hi all,

I am currently doing something following:

import ROOT
import numpy as np
import scipy

df = ROOT.RDataFrame(10000000)
coordDefineCode = '''
    ROOT::RVecD {0}(len);
    std::transform({0}.begin(), {0}.end(), {0}.begin(), [](double){{return gRandom->Uniform(-1.0, 1.0);}});
    return {0};
'''
d = df.Define("len", "gRandom->Uniform(0, 31337)")\
      .Define("x", coordDefineCode.format("x"))\

# Now I want to get full and detailed overview of this variable:
arrays = df.AsNumpy()
data = np.concatenate([np.array(rvec) for rvec in arrays["x"]])

mean = np.mean(data)
median = np.median(data)
std = np.std(data)
skewness = scipy.stats.skew(data)
kurtosis = scipy.stats.kurtosis(data)
min_val = np.min(data)
quantiles = np.quantile(data, [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
max_val = np.max(data)

My question is, what would be alternative code using RDataFrame for this?
Converting to numpy and calculating everything in python seems like not the most optimal thing to do here. However, I am not sure if it is possible to acheive everything that I want using ROOT…

Thanks!

Hello @FoxWise,

  • For plain values, you can use the Stats() action to create a TStatistic object. This one supports Max/Min, mean and the variance.
  • If your column consists of RVec, however, you indeed have to write some extra code to “unpack” those vectors, because there’s not necessarily a unique way to unpack them. (How do you e.g. want to treat event weights? To all entries, or only to the first, … ?). The way you did the unpacking is certainly a valid solution of the problem at the cost of an extra copy of the data in memory.
  • Another way to unpack and process further is the Take() action, but this will also make a copy.
  • You also have the option of writing your own action, and book it with Fill.
  • If you don’t want to keep all data in memory you can simply fill a histogram. They are meant to do information reduction, so not all data has to be kept in memory. Once filled, you can query the statistics using e.g. GetSkewness() and similar. The histograms support skewness, kurtosis, rms, mean, variance and their errors. You can activate the display of these using SetOptStat() if you want to see them in the plots.
    • An added benefit is that you can use these with multithreading enabled, and per-thread results are merged automatically.
    • Furthermore, you can supply a weight column that either applies one weight to all entries, or if the weight column itself is an RVec, it applies the entries of the latter one-by-one to the entries of the data column.

So you see, there’s multiple solutions to the problem, all valid, and in the end I think it depends on your use case and the resource constraints. :slightly_smiling_face:

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.