ROOT Version:_ 6.24/02
Platform:Linux/Fedora 34
Compiler: gcc version 11.3.1 20220421
I have a tree which contains columns of variable size arrays.
d of type Int_t
e of type ULong64_t
t of type ULong64_t
Let’s assume following arrays for this discussion: d{1, 2, 3, 4} e{100, 200, 300, 400} t{1111, 2222, 3333, 4444}
Now, I want to filter the above columns IF the elements of d are 2 & 3 and keep corresponding elements of e and t i.e. after filtering I want the following:
d{2, 3} e{200, 300} t{2222, 3333}
How can this be achieved with RDataFrame? The resulting RDataFrame will be used for further analysis.
I think the strategy is to Define a column with the indices of the selected values of the d array and use them to filter the columns e and t. We will use VecOps for the sake of this example:
ROOT::RDataFrame empty_df(1);
auto df = empty_df.Define("d", [](){return ROOT::RVec<int>({1, 2, 3, 4});})
.Define("e", [](){return ROOT::RVec<ULong64_t>({100, 200, 300, 400});})
.Define("t", [](){return ROOT::RVec<ULong64_t>({1111, 2222, 3333, 4444});});
// Extract indices and apply them to the rvecs
auto df2 = df.Define("indices", "(d==2) || ( d==3 )")
.Define("new_e", "e[indices]")
.Define("new_t", "t[indices]");
// Show the result
auto e = df2.Take<ROOT::RVec<ULong64_t>>("e");
auto t = df2.Take<ROOT::RVec<ULong64_t>>("t");
auto indices = df2.Take<ROOT::RVec<int>>("indices");
auto new_e = df2.Take<ROOT::RVec<ULong64_t>>("new_e");
auto new_t = df2.Take<ROOT::RVec<ULong64_t>>("new_t");
std::cout << e.GetValue()[0] << std::endl
<< t.GetValue()[0] << std::endl
<< indices.GetValue()[0] << std::endl
<< new_e.GetValue()[0] << std::endl
<< new_t.GetValue()[0] << std::endl;
If more performance is wished, the strings in the Defines can be transformed in the corresponding C++ (lambda) functions.
If the type of the e, t and d columns is not RVec, the code will have to be adapted accordingly (the documentation about the VecOps is here: the implementation of the operator [] can be found there).
Yes, this is exactly what I was looking for. Thank you very much for the kind help.
As mentioned by you, it is bit slower compared to the selection/filter (although I know that it doesn’t do exactly what I was looking for) which I used in the meantime.
Filter("d[0] == 2 && d[1] == 3")
Could this be also due to the extra columns which are defined in the code above?
Exactly, nothing more. For small workflows, you should not see any difference - the JITted code by ROOT is quite performant. Clearly those calls cannot be inlined and do not benefit from a full fledged compiler pass, that sees the rest of the code. If at some point you feel you might use more performance you know where to look first
To learn further, I modified your code as follows with a concise lambda.
This is also because I want to mention the values elements of d as parameters
void d_filter(int d1, int d2)
{
int va = d1;
int vb = d2;
auto dSelect = [&, va, vb](ROOT::RVec<int> a, ROOT::RVec<ULong64_t> b) {return b[a == va || a == vb];};
ROOT::RDataFrame empty_df(1);
auto df = empty_df.Define("d", [](){return ROOT::RVec<int>({1, 2, 3, 4});})
.Define("e", [](){return ROOT::RVec<ULong64_t>({100, 200, 300, 400});})
.Define("t", [](){return ROOT::RVec<ULong64_t>({1111, 2222, 3333, 4444});});
auto df2 = df
.Define("new_e", dSelect, {"d", "e"})
.Define("new_t", dSelect, {"d", "t"});
auto e = df2.Take<ROOT::RVec<ULong64_t>>("e");
auto t = df2.Take<ROOT::RVec<ULong64_t>>("t");
auto new_e = df2.Take<ROOT::RVec<ULong64_t>>("new_e");
auto new_t = df2.Take<ROOT::RVec<ULong64_t>>("new_t");
std::cout << e.GetValue()[0] << std::endl
<< t.GetValue()[0] << std::endl
<< new_e.GetValue()[0] << std::endl
<< new_t.GetValue()[0] << std::endl;
}
Now I can run this code as:
.L d_filter.C++
d_filter(2,3)
d_filter(1,2)
etc. Please point out if there is any mistake/bug in this code. If there is none, then I consider this is the final version of what I wanted to achieve as mentioned in my first post in this thread.