What is the best way to create a flatten RDataFrame?

kandrosov · September 12, 2019, 8:22am

Hello,
Let’s take as an example the following use-case, which is rather common for HEP. In the original TTree each entry corresponds to one event. Each event contains information about multiple objects of different types that is stored as std::vector<float> (or arrays). I would like to study the behavior of objects that belongs to the given type, for example tau and have the possibility to fully exploit the benefits provided by RDataFrame (e.g. Define, Filter, Reduce etc.). As I see it, the best way to do it would be something like:

auto df_taus = df_events.Flatten({"tau_pt", "tau_eta", ...});
// then use df_taus as a normal RDataFrame, e.g.
df_taus = df_taus.Filter("tau_pt > 20 && abs(tau_eta) < 2.3");
df_taus = df_taus.Define("tau_E_over_pt", "tau_E / tau_pt");
// and so on

But I don’t succeed to find any elegant solution for that, which would provide functionality that is similar to the one illustrated above. Does someone have any suggestions?

Thank you!

Cheers,
Konstantin.

eguiraud · September 12, 2019, 8:42am

Hi,
The corresponding RDF feature would be Explode, but it’s still only a prototype, see https://sft.its.cern.ch/jira/browse/ROOT-9225.

In the meantime, you can use tau_pt an tau_eta as RVec (https://root.cern/doc/master/classROOT_1_1VecOps_1_1RVec.html) and write e.g.

Define("good_idx", "tau_pt > 20").Define("tau_E_over_pt", "tau_E[good_idx] / tau_pt[good_idx]")

Hope this helps!
Enrico

kandrosov · September 12, 2019, 8:56am

Thank you Enrico!
Explode would be exactly what I was searching for. Do you know if there are plans to include it into ROOT 6.20?

Yes, it can help, to some extent. Thanks! One question: how to create a numpy array in such case? Will df.AsNumpy(columns=["tau_E_over_pt"]) work?

eguiraud · September 12, 2019, 9:09am

There are other changes with higher priority, so not necessarily…buuuut getting pinged by users is a good way to increase the urgency of a feature request .

Handing off to @swunsch for the numpy arrays question.

Cheers,
Enrico

swunsch · September 12, 2019, 12:31pm

Hi Konstantin!

You can try following snipplet to check how arrays are read out using AsNumpy:

import ROOT
df = ROOT.RDataFrame(
        "Events",
        "root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/Run2012B_DoubleMuParked.root")
print(df.Range(3).AsNumpy(columns=["Muon_pt"]))

{'Muon_pt': numpy.array([<ROOT.ROOT::VecOps::RVec<float> object at 0x630d310>,
             <ROOT.ROOT::VecOps::RVec<float> object at 0x630d338>,
             <ROOT.ROOT::VecOps::RVec<float> object at 0x630d360>],
            dtype=object)}

Because the arrays can have variable size, we read them out as the corresponding C++ object (ROOT::RVec<float> in this case). Note that the read out is much slower than reading fundamental types since we have to create the Python proxies for each event.

Best
Stefan

kandrosov · September 12, 2019, 3:02pm

Hi Stefan,
thank you!
is there a way to convert such array to a “normal” numeric numpy array with padding without doing an explicit loop in python? I tried to do it as in the code below (and some variations of it), but didn’t succeed to make it work.

n_evt = 3
data = df.Range(n_evt).AsNumpy(columns=["nMuon", "Muon_pt"])
max_n_muon = int(np.amax(data['nMuon']))
x = np.zeros((n_evt, max_n_muon))
x[:, :] = data['Muon_pt'][:][:]

swunsch · September 12, 2019, 4:01pm

Sry I’m on the phone, so very brief: You can do simply the padding in C++ with a Define node!

kandrosov · September 12, 2019, 5:15pm

Hm… indeed. Silly me Thanks a lot! Following your suggestions I finally arrived to extract a 2D numpy array!

import ROOT
import numpy as np

ROOT.gInterpreter.Declare('''
template<typename T>
ROOT::RVec<T> ApplyPadding(const ROOT::RVec<T>& x, size_t max_size, const T& pad)
{
    ROOT::RVec<T> padded = x;
    padded.resize(max_size, pad);
    return padded;
}
''')
df_full = ROOT.RDataFrame("Events",
        "root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/Run2012B_DoubleMuParked.root")
column = 'Muon_pt'
padded_column = column + '_padded'
df = df_full.Range(3)
max_n_muon = int(df.Max('nMuon').GetValue())
df = df.Define(padded_column, 'ApplyPadding({}, {}, 0.f)'.format(column, max_n_muon))
data = df.AsNumpy(columns=[padded_column])[padded_column]
print(np.stack(data))

[[52.008335  42.85704  ]
 [ 5.0199485  0.       ]
 [15.967432  12.48129  ]]

Best,
Konstantin.

kandrosov · September 12, 2019, 6:02pm

P.S. Although the solution I pointed above is working, it is way too slow for even a moderate dataset size, and would not be suitable for any real application… Any ideas how the performance could be improved?

system · September 26, 2019, 6:02pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.