Filtering columns using RDataFrame


ROOT Version:_ 6.24/02
Platform:Linux/Fedora 34
Compiler:
gcc version 11.3.1 20220421


I have a tree which contains columns of variable size arrays.

  1. d of type Int_t
  2. e of type ULong64_t
  3. t of type ULong64_t

Let’s assume following arrays for this discussion:
d{1, 2, 3, 4}
e{100, 200, 300, 400}
t{1111, 2222, 3333, 4444}

Now, I want to filter the above columns IF the elements of d are 2 & 3 and keep corresponding elements of e and t i.e. after filtering I want the following:

d{2, 3}
e{200, 300}
t{2222, 3333}

How can this be achieved with RDataFrame? The resulting RDataFrame will be used for further analysis.

Any help is highly appreciated.

Regards,

Ajay

Hello,

I think the strategy is to Define a column with the indices of the selected values of the d array and use them to filter the columns e and t. We will use VecOps for the sake of this example:

ROOT::RDataFrame empty_df(1);
auto df = empty_df.Define("d", [](){return ROOT::RVec<int>({1, 2, 3, 4});})
                  .Define("e", [](){return ROOT::RVec<ULong64_t>({100, 200, 300, 400});})
                  .Define("t", [](){return ROOT::RVec<ULong64_t>({1111, 2222, 3333, 4444});});

// Extract indices and apply them to the rvecs
auto df2 = df.Define("indices", "(d==2) || ( d==3 )")
             .Define("new_e", "e[indices]")
             .Define("new_t", "t[indices]");

// Show the result
auto e = df2.Take<ROOT::RVec<ULong64_t>>("e");
auto t = df2.Take<ROOT::RVec<ULong64_t>>("t");
auto indices = df2.Take<ROOT::RVec<int>>("indices");
auto new_e = df2.Take<ROOT::RVec<ULong64_t>>("new_e");
auto new_t = df2.Take<ROOT::RVec<ULong64_t>>("new_t");

std::cout << e.GetValue()[0] << std::endl
          << t.GetValue()[0] << std::endl
          << indices.GetValue()[0] << std::endl
          << new_e.GetValue()[0] << std::endl
          << new_t.GetValue()[0] << std::endl;

If more performance is wished, the strings in the Defines can be transformed in the corresponding C++ (lambda) functions.
If the type of the e, t and d columns is not RVec, the code will have to be adapted accordingly (the documentation about the VecOps is here: the implementation of the operator [] can be found there).

I hope this helps.

Cheers,
D

2 Likes

Dear @Danilo,

Yes, this is exactly what I was looking for. Thank you very much for the kind help.

As mentioned by you, it is bit slower compared to the selection/filter (although I know that it doesn’t do exactly what I was looking for) which I used in the meantime.

Filter("d[0] == 2 && d[1] == 3")

Could this be also due to the extra columns which are defined in the code above?

Regards,

Ajay

@Danilo

When you say - corresponding C++ (lambda) function, do you mean using:

auto dSelect =  [](ROOT::RVec<int> x) {return x == 2 || x == 3;};
.
.
df.Define("indices", dSelect, {"d"})
  .Define("new_e", "e[indices]")
  .Define("new_t", "t[indices]");
.
.
instead of

df.Define("indices", "(d == 2 || d == 3)")
  .Define("new_e", "e[indices]")
  .Define("new_t", "t[indices]");
.
.

or something more?

Ajay

Hi Ajay,

Exactly, nothing more. For small workflows, you should not see any difference - the JITted code by ROOT is quite performant. Clearly those calls cannot be inlined and do not benefit from a full fledged compiler pass, that sees the rest of the code. If at some point you feel you might use more performance you know where to look first :slight_smile:

Cheers,
D

Hi @Danilo,

Thank you for the message!

To learn further, I modified your code as follows with a concise lambda.
This is also because I want to mention the values elements of d as parameters

void d_filter(int d1, int d2)
{
int va = d1;
int vb = d2;

auto dSelect = [&, va, vb](ROOT::RVec<int> a, ROOT::RVec<ULong64_t> b) {return b[a == va || a == vb];};

ROOT::RDataFrame empty_df(1);
auto df = empty_df.Define("d", [](){return ROOT::RVec<int>({1, 2, 3, 4});})
                  .Define("e", [](){return ROOT::RVec<ULong64_t>({100, 200, 300, 400});})
                  .Define("t", [](){return ROOT::RVec<ULong64_t>({1111, 2222, 3333, 4444});});

auto df2 = df
        .Define("new_e", dSelect, {"d", "e"})
        .Define("new_t", dSelect, {"d", "t"});

auto e = df2.Take<ROOT::RVec<ULong64_t>>("e");
auto t = df2.Take<ROOT::RVec<ULong64_t>>("t");
auto new_e = df2.Take<ROOT::RVec<ULong64_t>>("new_e");
auto new_t = df2.Take<ROOT::RVec<ULong64_t>>("new_t");

std::cout << e.GetValue()[0] << std::endl
          << t.GetValue()[0] << std::endl
          << new_e.GetValue()[0] << std::endl
          << new_t.GetValue()[0] << std::endl;
}

Now I can run this code as:

.L d_filter.C++
d_filter(2,3)
d_filter(1,2)

etc. Please point out if there is any mistake/bug in this code. If there is none, then I consider this is the final version of what I wanted to achieve as mentioned in my first post in this thread.

Thank you once again for your help.

Pinging just to keep this thread alive.

Hi. No need to ping, it looks fine if you validated and benchmarked it.

What is equivalent of std::any_of in ROOT::RVec or ROOT::VecOp?
I am unable to find here.

Hi,

This should be a different post, but here I go.
You do not need a corresponding utility but can just use std::any_of:

root [0] ROOT::RVec<int> v {1,2,3,4}
(ROOT::RVec<int> &) { 1, 2, 3, 4 }
root [1] std::any_of(v.begin(), v.end(), [](int i){return i==3;})
(bool) true
root [2] std::any_of(v.begin(), v.end(), [](int i){return i==64;})
(bool) false

Thanks dear @Danilo!

May I please also bring your attention to this post.

I have tried my best, but still unable to achieve what I am trying to do as mentioned in that post.

As you might have noticed from our discussion, I am slowly trying to port my data analysis to RDataFrame.