Filtering columns using RDataFrame

ajaydeo · January 19, 2024, 3:19pm

ROOT Version:_ 6.24/02
Platform:Linux/Fedora 34
Compiler: gcc version 11.3.1 20220421

I have a tree which contains columns of variable size arrays.

d of type Int_t
e of type ULong64_t
t of type ULong64_t

Let’s assume following arrays for this discussion:
d{1, 2, 3, 4}
e{100, 200, 300, 400}
t{1111, 2222, 3333, 4444}

Now, I want to filter the above columns IF the elements of d are 2 & 3 and keep corresponding elements of e and t i.e. after filtering I want the following:

d{2, 3}
e{200, 300}
t{2222, 3333}

How can this be achieved with RDataFrame? The resulting RDataFrame will be used for further analysis.

Any help is highly appreciated.

Regards,

Ajay

Danilo · January 21, 2024, 6:27am

Hello,

I think the strategy is to Define a column with the indices of the selected values of the d array and use them to filter the columns e and t. We will use VecOps for the sake of this example:

ROOT::RDataFrame empty_df(1);
auto df = empty_df.Define("d", [](){return ROOT::RVec<int>({1, 2, 3, 4});})
                  .Define("e", [](){return ROOT::RVec<ULong64_t>({100, 200, 300, 400});})
                  .Define("t", [](){return ROOT::RVec<ULong64_t>({1111, 2222, 3333, 4444});});

// Extract indices and apply them to the rvecs
auto df2 = df.Define("indices", "(d==2) || ( d==3 )")
             .Define("new_e", "e[indices]")
             .Define("new_t", "t[indices]");

// Show the result
auto e = df2.Take<ROOT::RVec<ULong64_t>>("e");
auto t = df2.Take<ROOT::RVec<ULong64_t>>("t");
auto indices = df2.Take<ROOT::RVec<int>>("indices");
auto new_e = df2.Take<ROOT::RVec<ULong64_t>>("new_e");
auto new_t = df2.Take<ROOT::RVec<ULong64_t>>("new_t");

std::cout << e.GetValue()[0] << std::endl
          << t.GetValue()[0] << std::endl
          << indices.GetValue()[0] << std::endl
          << new_e.GetValue()[0] << std::endl
          << new_t.GetValue()[0] << std::endl;

If more performance is wished, the strings in the Defines can be transformed in the corresponding C++ (lambda) functions.
If the type of the e, t and d columns is not RVec, the code will have to be adapted accordingly (the documentation about the VecOps is here: the implementation of the operator [] can be found there).

I hope this helps.

Cheers,
D

ajaydeo · January 21, 2024, 12:12pm

Dear @Danilo,

Yes, this is exactly what I was looking for. Thank you very much for the kind help.

As mentioned by you, it is bit slower compared to the selection/filter (although I know that it doesn’t do exactly what I was looking for) which I used in the meantime.

Filter("d[0] == 2 && d[1] == 3")

Could this be also due to the extra columns which are defined in the code above?

Regards,

Ajay

ajaydeo · January 21, 2024, 3:01pm

@Danilo

When you say - corresponding C++ (lambda) function, do you mean using:

auto dSelect =  [](ROOT::RVec<int> x) {return x == 2 || x == 3;};
.
.
df.Define("indices", dSelect, {"d"})
  .Define("new_e", "e[indices]")
  .Define("new_t", "t[indices]");
.
.
instead of

df.Define("indices", "(d == 2 || d == 3)")
  .Define("new_e", "e[indices]")
  .Define("new_t", "t[indices]");
.
.

or something more?

Ajay

Danilo · January 21, 2024, 5:17pm

Hi Ajay,

Exactly, nothing more. For small workflows, you should not see any difference - the JITted code by ROOT is quite performant. Clearly those calls cannot be inlined and do not benefit from a full fledged compiler pass, that sees the rest of the code. If at some point you feel you might use more performance you know where to look first

Cheers,
D

ajaydeo · January 21, 2024, 6:15pm

Hi @Danilo,

Thank you for the message!

To learn further, I modified your code as follows with a concise lambda.
This is also because I want to mention the values elements of d as parameters

void d_filter(int d1, int d2)
{
int va = d1;
int vb = d2;

auto dSelect = [&, va, vb](ROOT::RVec<int> a, ROOT::RVec<ULong64_t> b) {return b[a == va || a == vb];};

ROOT::RDataFrame empty_df(1);
auto df = empty_df.Define("d", [](){return ROOT::RVec<int>({1, 2, 3, 4});})
                  .Define("e", [](){return ROOT::RVec<ULong64_t>({100, 200, 300, 400});})
                  .Define("t", [](){return ROOT::RVec<ULong64_t>({1111, 2222, 3333, 4444});});

auto df2 = df
        .Define("new_e", dSelect, {"d", "e"})
        .Define("new_t", dSelect, {"d", "t"});

auto e = df2.Take<ROOT::RVec<ULong64_t>>("e");
auto t = df2.Take<ROOT::RVec<ULong64_t>>("t");
auto new_e = df2.Take<ROOT::RVec<ULong64_t>>("new_e");
auto new_t = df2.Take<ROOT::RVec<ULong64_t>>("new_t");

std::cout << e.GetValue()[0] << std::endl
          << t.GetValue()[0] << std::endl
          << new_e.GetValue()[0] << std::endl
          << new_t.GetValue()[0] << std::endl;
}

Now I can run this code as:

.L d_filter.C++
d_filter(2,3)
d_filter(1,2)

etc. Please point out if there is any mistake/bug in this code. If there is none, then I consider this is the final version of what I wanted to achieve as mentioned in my first post in this thread.

Thank you once again for your help.

ajaydeo · January 31, 2024, 11:54am

Pinging just to keep this thread alive.

Danilo · January 31, 2024, 2:23pm

Hi. No need to ping, it looks fine if you validated and benchmarked it.

ajaydeo · February 14, 2024, 1:12pm

What is equivalent of std::any_of in ROOT::RVec or ROOT::VecOp?
I am unable to find here.

Danilo · February 14, 2024, 9:09pm

Hi,

This should be a different post, but here I go.
You do not need a corresponding utility but can just use std::any_of:

root [0] ROOT::RVec<int> v {1,2,3,4}
(ROOT::RVec<int> &) { 1, 2, 3, 4 }
root [1] std::any_of(v.begin(), v.end(), [](int i){return i==3;})
(bool) true
root [2] std::any_of(v.begin(), v.end(), [](int i){return i==64;})
(bool) false

ajaydeo · February 15, 2024, 6:28am

Thanks dear @Danilo!

May I please also bring your attention to this post.

I have tried my best, but still unable to achieve what I am trying to do as mentioned in that post.

As you might have noticed from our discussion, I am slowly trying to port my data analysis to RDataFrame.

system · February 29, 2024, 6:29am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.