Hello,
Let’s take as an example the following use-case, which is rather common for HEP. In the original TTree each entry corresponds to one event. Each event contains information about multiple objects of different types that is stored as std::vector<float> (or arrays). I would like to study the behavior of objects that belongs to the given type, for example tau and have the possibility to fully exploit the benefits provided by RDataFrame (e.g. Define, Filter, Reduce etc.). As I see it, the best way to do it would be something like:
auto df_taus = df_events.Flatten({"tau_pt", "tau_eta", ...});
// then use df_taus as a normal RDataFrame, e.g.
df_taus = df_taus.Filter("tau_pt > 20 && abs(tau_eta) < 2.3");
df_taus = df_taus.Define("tau_E_over_pt", "tau_E / tau_pt");
// and so on
But I don’t succeed to find any elegant solution for that, which would provide functionality that is similar to the one illustrated above. Does someone have any suggestions?
There are other changes with higher priority, so not necessarily…buuuut getting pinged by users is a good way to increase the urgency of a feature request .
Handing off to @swunsch for the numpy arrays question.
{'Muon_pt': numpy.array([<ROOT.ROOT::VecOps::RVec<float> object at 0x630d310>,
<ROOT.ROOT::VecOps::RVec<float> object at 0x630d338>,
<ROOT.ROOT::VecOps::RVec<float> object at 0x630d360>],
dtype=object)}
Because the arrays can have variable size, we read them out as the corresponding C++ object (ROOT::RVec<float> in this case). Note that the read out is much slower than reading fundamental types since we have to create the Python proxies for each event.
Hi Stefan,
thank you!
is there a way to convert such array to a “normal” numeric numpy array with padding without doing an explicit loop in python? I tried to do it as in the code below (and some variations of it), but didn’t succeed to make it work.
n_evt = 3
data = df.Range(n_evt).AsNumpy(columns=["nMuon", "Muon_pt"])
max_n_muon = int(np.amax(data['nMuon']))
x = np.zeros((n_evt, max_n_muon))
x[:, :] = data['Muon_pt'][:][:]
P.S. Although the solution I pointed above is working, it is way too slow for even a moderate dataset size, and would not be suitable for any real application… Any ideas how the performance could be improved?