Apply StdDev to each bin with RDataFrame (groupby feature)

wiso · June 10, 2020, 10:32am

I would like to divide my data into non-regular bins, depending on the value of a column (var1). For each bin then I want to compute the std of another variable (var2).

If my problem would be “compute the mean” instead of “compute the std” TProfile would work. I guess that Profile method of RDataframe is using TProfile, so as for as I know TProfile can only do means.

What a TProfile is doing is actually two operations: grouping (binning) and applying a reduction function (the mean). It is true that in TProfile the two operarations are not done one after the other (first grouping all events, then computing the mean). Is that possible to generalize that in RDataFrame (or in TProfile)? Presently I could define all the bins manually with Filter, but that is very tedious (my problem is 2D…).

In other frameworks this is solved with a groupby followed by an aggregation function able to run on top of multiple dataframes (actually Series). In pandas it would be (in 1D)

df.groupby(pd.groupby(“var1”, binning))[‘var2’].std()

pd.cut is a function that does the quantization of var1 to integers according to the binning. Then the method std is applied on all the grouped series. So at the end I would expect a vector of numbers.

Do you plan to extend RDataframe for these use cases? Is there a quick workaround? Maybe you define a sort of Filter function bases on a binning that return multiple dataframes, then I can loop on them manually. The problem is that I have to manually link the index of the list of the dataframes with the index of the bins (not trivial in 2d)

eguiraud · June 10, 2020, 10:56am

Hi @wiso,
first let me check that I understand correctly: given columns ‘var1’ and ‘var2’, for a set of ‘var1’ bins (i.e. disjoint, contiguous value ranges), you need to evaluate the stddev of the corresponding ‘var2’ values. Analogous to filling a TH2D with the desired binning in ‘var1’ and a sufficiently fine-grained binning in ‘var2’, and then take stddevs along the bins in ‘var1’ (hence the TProfile analogy).

This can certainly be done (worst case scenario, by implementing the appropriate RDF helper, but that’s the bazooka solution – or as you say by filling a TH2D and then manually looping over the ‘var1’ bins. Defining one Filter per bin and then using StdDev is also possible but very inefficient.

So the question is what’s the simplest/most elegant way to do this efficiently with RDataFrame.

Are we talking C++ or python?
Can you share a few events that I could play with (even just quickly generated with df.Define(...).Snapshot(...) ?

Cheers,
Enrico

StephanH · June 10, 2020, 11:00am

Given that TProfile can compute the standard deviation as bin error, would it be enough to set it to stdDev mode, and retrieve the bin error?
https://root.cern.ch/doc/master/classTProfile.html

Set option = "s" to do this.

wiso · June 10, 2020, 11:05am

@StephanH, yes that could work, but this is really a workaround. With that trick you can also compute the sum and the number of entries in each bin, but not much more.

StephanH · June 10, 2020, 11:08am

Yes, I really replied only to your question if there’s a quick workaround.

wiso · June 10, 2020, 3:01pm

Hello @eguiraud, thank you for your reply. Here a code to reproduce the TTree and what I want to achieve (take into account that my problem is actually 2D: I want to bin my events based on the value of two variables)

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, 100, size=(10, 2)), columns=list('AB'))
print(df)
binning = [0, 50, 100]
all_std = df.groupby(pd.cut(df['A'], bins=binning))['B'].std()
print(all_std)

import uproot

with uproot.recreate("example.root") as f:
    f["mytree"] = uproot.newtree(df.dtypes.to_dict())
    f["mytree"].extend(df.to_dict('list'))

what is printed is the small dataframe:

the last output is the std for each bin (only 2 bins):

(0, 50]      19.882153
(50, 100]    31.846507

I attach the TTreeexample.root (13.1 KB)

eguiraud · June 15, 2020, 11:24am

Hi,
thanks for the extra explanation. So, there is the quick workaround for this specific task, and there is a bazooka solution for the generic case: you can always hand-code any data processing logic in a customized RDF helper class, and RDF will execute the helper’s logic during the possibly multi-thread event loop.

The question is then whether there is or there can be an elegant built-in API that lets RDF cover usecases such as this. Currently, I don’t think there is*.
Ideally, I guess you would want to be able to generate a map (group_id -> RDF node) from a single RDF node with group_dfs = df.GroupBy(grouping_fn). Then you could define stddevs = {k: df.StdDev(...) for k, df in group_dfs}. With this API, GroupBy could not be lazily evaluated though, because we need to know how many RDF nodes to return from it at the point we call it (C++ does not have generators). I wonder if we could do better. work in progress

Cheers,
Enrico

*best I can think of: RDF::Aggregate can be used to create the groups, and might return a std::map<int, std::vector> (easily convertible to a python dictionary with { k: np.array(v) for k, v in map }) that would correspond to the all_std variable in your snippet above. You can then create another RDF from the dictionary with ROOT.RDF.MakeNumpyDataFrame. I realize this is not pythonic (it even involves writing the aggregator function in C++, and it’s an annoying thing to have to code)

swunsch · June 15, 2020, 2:21pm

Hi!

@wiso Since pandas does the job excellently with the groupby transformation, you can also use the infrastructure around RDataFrame to push the data directly to pandas. Have a look at the following (runnable!) example:

import ROOT
import pandas

# Create a ROOT dataframe (note that this code is runnable!)
filename = 'root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root'
df = ROOT.RDataFrame('Events', filename)

# Read exactly the data you need to numpy arrays
data = df.Filter('nMuon == 2')\
         .Range(1000)\
         .Define('pt_1', 'Muon_pt[0]')\
         .Define('pt_2', 'Muon_pt[1]')\
         .AsNumpy(['pt_1', 'pt_2'])

# Push the data to pandas and use the groupby transformation
pdf = pandas.DataFrame(data)
std = pdf.groupby(pandas.cut(pdf['pt_1'], bins=[0, 50, 100]))['pt_2'].std()
print(std)

pt_1
(0, 50]      134.726929
(50, 100]     15.431849
Name: pt_2, dtype: float32

We provide interoperability to the python data science ecosystem exactly for such use cases that cannot be done efficiently with the highlevel interfaces in ROOT such as RDataFrame. However, since for your application you just need two columns on the “python side”, doing the preprocessing with RDataFrame in C++ and push just the required data to memory seems to me a feasible strategy.

Best
Stefan