[RDF] Ability to Drop Columns after application of selections

lost_soul_519 · June 2, 2025, 2:59pm

Hello,

I am trying to use RDataFrame for an analysis where I need to apply a selection to the RDF by a function apply_selection. The complexity of the selection means I have to define some columns.
Although when returning the rdf_baseline, I do not want these columns as different selections call for different columns to be defined and pollute the column list and causes errors with possible redefinitions down the line.

def apply_selection(rdf):
    rdf_baseline = rdf.Define("Track_st1", "Track_z<1200")
    rdf_baseline = rdf.Define("Track_st2", "Track_z>1200")
    rdf_baseline = rdf.Define("Track_r", "Track_x*Track_x + Track_y*Track_y")
    rdf_baseline = rdf.Filter("Sum(Track_st1)>=2 && Sum(Track_r[Track_st1] < 100)>=2", "NTrackSt1>=2")
    rdf_baseline = rdf.Filter("Sum(Track_st1)>=2", "NTrackSt1>=2")
    # More conditions based on nontrivial defines...
    report = rdf_baseline.Report()
    return rdf_baseline, report

The other alternatives, without dropping columns, is to store the new definitions as strings and pass around f-strings. Or do everything in a single function with an exceptionally large signature. Just checking if there are any other ways around this.

Thanks.

mczurylo · June 3, 2025, 8:56am

Hi @lost_soul_519,

thank you for your question. You could use the Snapshot specifying the columns you want to save, but there is no specific functionality to just drop the columns, you would need a helper function for that.

If you just want the columns from the original rdf, before the added Defines, you can simply first get a list of those columns using df.GetColumnNames() and use those in your Snapshot call.

Cheers,
Marta

lost_soul_519 · June 3, 2025, 9:22am

Thanks @mczurylo ,
But Snapshot would create an output which isn’t exactly the intended outcome here. Just needed some “transient” variables so as to define intermediate variables to simplify the apply_selection, and let them go “out of scope” at the end of the function. (Thus the add intermediate vars in columns and drop them before returning the RDF)
For now, I will just use a calculate the intermediates in a C++ function with a long signature.

Thanks again,

system · June 17, 2025, 9:22am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.