Filtering and saving new root files using RDataFrame

psadangi · April 28, 2021, 8:48am

Hello all,
I want to filter my root files based on some cuts applied to the variables existed in the files. I have found some discussions about it, In RDataFrame using Filter() we can do it. I tried but I couldn’t able to do it. For example, I have a variable ‘mass’ in the root file, I want the root file where all the variables will be applied a filter of ‘mass>5’. Is it possible here ? Please let me know how I will use the branch variables to Filter().

There are other methods also like to write a code and run over it which is lengthy. I was curious if I could do it with less line of code using RDataFrame.

Thanks
Priyanka

eguiraud · April 28, 2021, 9:09am

Hi @psadangi ,
if mass is a scalar, and you want to select entries for which mass > 5, all you need is:

df.Filter("mass > 5").Snapshot("newtree", "newfile.root")

If mass is an array, and you want to select array elements of other variables that correspond to elements for which mass[i] > 5, then you have to write something like this:

df.Define("good_idx", "mass > 5")
  .Redefine("other_var", "other_var[good_idx]")
  .Redefine("other_var2", "other_var2[good_idx]")
  .Snapshot("newtree", "newfile.root")

(Redefine is currently only available in nightly builds, in v6.24 or earlier you would have to give the new column a different name

Please check the RDF user guide and our RDF tutorials for more info, or ask here if you have a specific question.

Cheers,
Enrico

psadangi · April 28, 2021, 9:26am

Hi @eguiraud ,

Thanks for your reply. My variables are vector<Float_t>. and I am using root version 6.22. So now what are the things need to be done here.

Thanks
Priyanka

eguiraud · April 28, 2021, 9:33am

Then you would be in case number 2 there, you define a “mask” of good indexes and then index each vector variable to select the elements that correspond to the good indexes.
There are also several tutorials that show how to work with collections in RDF.

The user guide has a section about working with collections that should help, and it points to the documentation of RVec which is the special vector-like type that defines those “fancy indexing” operations that we are using (RDF reads all collections as RVecs by default).

If the limitation of having to Define the filtered collections with a different name is too strong, you can get a ROOT build with the Redefine feature from our nightly releases.

Cheers,
Enrico

system · May 12, 2021, 9:33am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.