RDataFrame - vector - access each element

Just to compile with the forum rules:

  • ROOT Version: 6.24/08
  • Platform: CentOS Linux release 7.9.2009
  • Compiler: GCC 4.8.5

Hello everyone,
I would like to ask about a problem I encountered while working on my analysis using RDataFrame. For simplicity I will skip the Whys and Whens and jump straight to the point.
I have a std::vector<float> particleEnergies stored in a TTree for every event that passes my selection and in this vector I store energies of particle. It is the same particle, same true energy but different values because there are multiple reconstruction algorithms based on which the value is calculated. My goal is to compare them. Therefore I need an access to each element of vector separately. In the first step I would like to do something like this:

auto df = ROOT::RDataFrame("myTree", "mytree.root");
auto algo0 = df.Histo1D("particleEnergies[0]");
auto algo1 = df.Histo1D("particleEnergies[1]");
auto algo2 = df.Histo1D("particleEnergies[2]");
.......

Basically I would like to create a histogram which is not filled by every value in vector (as is the case in documentation) but only by 1 element. I tried this and it is not possible as branch with particleEnergies[] does not exist.

There is of course possibility to store every element in separate branch like float E1, E2, E3, E4,… already in selection and then access each branch individually. However I do not like this approach as I am hardcoding everything and in case an algorithm is added in the future or taken away I would need to change it almost from scratch.

I already discussed this with your ROOT colleague and she came up with a very nice solution where I firstly Define the branch from vector and then access it. Something like this:

auto df = ROOT::RDataFrame("myTree", "mytree.root");
auto loopDf = df.Define("NOOP", "0");
for (int i = 0; i < 9; i++) {
  loopDf = loopDf.Define("particleEnergies" + std::to_string(i), "particleEnergies[i]");
}
// Now every element has a new separate branch and I can access it and for example fill histos
for (int i = 0; i < 9; i++) {
 loopDf.Histo1D(Form("particleEnergies%i", i));
 ........
}

The question I have is this. Is this the correct approach or is there a better one or different one where one can fully use functionalities of RDataFrame and access individual elements of vector and fill histograms “easily” without creating/duplocating branch? And if it is at all possible or maybe completely stupid to make possible my first approach in the future? (example below)

auto df = ROOT::RDataFrame("myTree", "mytree.root");
auto algo3 = df.Histo1D("particleEnergies[3]");

Maybe @vpadulan can give his thought

Hello @Zdenko_Hives ,

and welcome to the ROOT forum!

That’s indeed how you would do it currently.
One feature that would simplify your code is at this Jira ticket, but unfortunately there hasn’t been any progress there in a long time.

The solution you propose (allowing expressions in place of column names) would work too but has some implications for the general RDF API that makes it a bit cumbersome to support:

  • we would need to detect whether the column name passed is a single name or an expression. this is tricky because it involves parsing code, but especially so because TTree branch names can be arbitrary strings
  • it cannot work for actions such as Snapshot that need a column name (in the case of Snapshot, to know what to call the branch in the output tree)
  • this is subtle, but it might push users to copy-paste those expressions around rather than Define-ing them once, which has performance implications

Let me take the chance to suggest to upgrade to the latest ROOT version, in which many new RDF features have been introduced and several bugs have been squashed!

Cheers,
Enrico

3 Likes

Alright, thank you very much. I will check the Jira ticket.

Cheers,
Zdenko Hives

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.