Filter data frame using external flags



One of the uses of adding a container to a dataframe would be to add a variable that can be used to remove entries:

df = add_column(df, arr_val, name)
df = df.Filter('var > 0.3')

where var values are arr_val. However, there are many caveats as seen in the post above. Would it be more feasible to have a function that filters based on an external container?

arr_flg = numexpr.evaluate('var>0.3', {'var' : arr_val})
df = df.Filter(arr_flg)

i.e. the filtering would take an array of bools instead of a string.


Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided

Can you give some more details of which limitation you’re running into? Maybe by writing down the if statement in a pseudo for loop?

My guess is that you expect one entry of arr_val to filter one entry of the input (“event”) - is that correct? In that case the most stable and performant solution is to create a friend tree that contains the flag for each entry of the input tree. You have to make sure that the order is the same, of course: when writing the friend tree you’ll need to run this without multithreading, once. Then you can just add that branch from the auxiliary friend tree that contains your filter, and filter on its branch - also in multithreading.

I think the conceptual issues described in the post you link are still present with a Filter(arr_flg) method in the scenarios in which upstream filters or Range calls are present.

If the arr_flg array describes a selection of entries over the whole dataset (no upstream filters/ranges) then transforming that array into a TEntryList and attaching it to the TTree/TChain is a viable workaround. Or indeed you can use a friend tree as Axel suggests.

Also see my errata at Adding data from an external container to a DataFrame - #15 by eguiraud


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.