Home | News | Documentation | Download

What is the best way to create a flatten RDataFrame?

Hello,
Let’s take as an example the following use-case, which is rather common for HEP. In the original TTree each entry corresponds to one event. Each event contains information about multiple objects of different types that is stored as std::vector<float> (or arrays). I would like to study the behavior of objects that belongs to the given type, for example tau and have the possibility to fully exploit the benefits provided by RDataFrame (e.g. Define, Filter, Reduce etc.). As I see it, the best way to do it would be something like:

auto df_taus = df_events.Flatten({"tau_pt", "tau_eta", ...});
// then use df_taus as a normal RDataFrame, e.g.
df_taus = df_taus.Filter("tau_pt > 20 && abs(tau_eta) < 2.3");
df_taus = df_taus.Define("tau_E_over_pt", "tau_E / tau_pt");
// and so on

But I don’t succeed to find any elegant solution for that, which would provide functionality that is similar to the one illustrated above. Does someone have any suggestions?

Thank you!

Cheers,
Konstantin.

Hi,
The corresponding RDF feature would be Explode, but it’s still only a prototype, see https://sft.its.cern.ch/jira/browse/ROOT-9225.

In the meantime, you can use tau_pt an tau_eta as RVec (https://root.cern/doc/master/classROOT_1_1VecOps_1_1RVec.html) and write e.g.

Define("good_idx", "tau_pt > 20").Define("tau_E_over_pt", "tau_E[good_idx] / tau_pt[good_idx]")

Hope this helps!
Enrico

Thank you Enrico!
Explode would be exactly what I was searching for. Do you know if there are plans to include it into ROOT 6.20?

Yes, it can help, to some extent. Thanks! One question: how to create a numpy array in such case? Will df.AsNumpy(columns=["tau_E_over_pt"]) work?

There are other changes with higher priority, so not necessarily…buuuut getting pinged by users is a good way to increase the urgency of a feature request :smile:.

Handing off to @swunsch for the numpy arrays question.

Cheers,
Enrico

Hi Konstantin!

You can try following snipplet to check how arrays are read out using AsNumpy:

import ROOT
df = ROOT.RDataFrame(
        "Events",
        "root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/Run2012B_DoubleMuParked.root")
print(df.Range(3).AsNumpy(columns=["Muon_pt"]))
{'Muon_pt': numpy.array([<ROOT.ROOT::VecOps::RVec<float> object at 0x630d310>,
             <ROOT.ROOT::VecOps::RVec<float> object at 0x630d338>,
             <ROOT.ROOT::VecOps::RVec<float> object at 0x630d360>],
            dtype=object)}

Because the arrays can have variable size, we read them out as the corresponding C++ object (ROOT::RVec<float> in this case). Note that the read out is much slower than reading fundamental types since we have to create the Python proxies for each event.

Best
Stefan

Hi Stefan,
thank you!
is there a way to convert such array to a “normal” numeric numpy array with padding without doing an explicit loop in python? I tried to do it as in the code below (and some variations of it), but didn’t succeed to make it work.

n_evt = 3
data = df.Range(n_evt).AsNumpy(columns=["nMuon", "Muon_pt"])
max_n_muon = int(np.amax(data['nMuon']))
x = np.zeros((n_evt, max_n_muon))
x[:, :] = data['Muon_pt'][:][:]

Sry I’m on the phone, so very brief: You can do simply the padding in C++ with a Define node!

Hm… indeed. Silly me :slight_smile: Thanks a lot! Following your suggestions I finally arrived to extract a 2D numpy array!

import ROOT
import numpy as np

ROOT.gInterpreter.Declare('''
template<typename T>
ROOT::RVec<T> ApplyPadding(const ROOT::RVec<T>& x, size_t max_size, const T& pad)
{
    ROOT::RVec<T> padded = x;
    padded.resize(max_size, pad);
    return padded;
}
''')
df_full = ROOT.RDataFrame("Events",
        "root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/Run2012B_DoubleMuParked.root")
column = 'Muon_pt'
padded_column = column + '_padded'
df = df_full.Range(3)
max_n_muon = int(df.Max('nMuon').GetValue())
df = df.Define(padded_column, 'ApplyPadding({}, {}, 0.f)'.format(column, max_n_muon))
data = df.AsNumpy(columns=[padded_column])[padded_column]
print(np.stack(data))
[[52.008335  42.85704  ]
 [ 5.0199485  0.       ]
 [15.967432  12.48129  ]]

Best,
Konstantin.

P.S. Although the solution I pointed above is working, it is way too slow for even a moderate dataset size, and would not be suitable for any real application… Any ideas how the performance could be improved? :slight_smile: