Treeanalysis with pyroot and RDataFrame


I am looking into DataFrames for the first time and have a hard time understanding how it works.

Say I have a tree with branches channel0, channel1, channel2… and each events is an array of floats, e.g.:

channel0 = [9.9, 13.4, 44.5, 12.3…]
channel1 = [1.1, 2.3, 0.5, 13.2…]

I load the tree into a Dataframe and want to perform some analysis and manipulations:

  1. Get the mean of each channel for each event but only consider the first 50 elements of the respective array for the calculation.
  2. Get the minimum value of the respective array for each event.
  3. Subtract the respective mean from this minimum for each channel and event.

How would I do that. Maybe someone could give me some rough outline. I would appreciate it, thank you!

In the ROOT::VecOps namespace are functions for doing means, maxes, etc on (RVec)tors of data. Additionally there is a Take function that only grabs the first ‘z’ elements of an array (plus a different version that selects items by a vector of indices, but not needed here). I don’t know if you are guaranteed to have 50+ elements in these arrays, but there’s an attempt to protect against that here. Now, nothing in case it’s absolutely empty… but here’s the sketch

#initial dataframe
rdf_base = ... #load data into dataframe

#counter to trigger our loop on
counter = rdf_base.Count()

#re-definable node for the loop
rdf = rdf_base

#dictionary for histograms
h = dict()

for n in range(4):
    #Get the mean per event, but only for at most 50 elements
    rdf = rdf.Define("channel{n}_mean".format(n=n), 
                             "Mean(Take(channel{n}, min(50, channel{n}.size())))".format(n=n))
    #get the minimum per event
    rdf = rdf.Define("channel{n}_min".format(n=n), 
    rdf = rdf.Define("channel{n}_min_minus_mean".format(n=n), 
                             "channel{n}_min - channel{n}_mean".format(n=n))
    h[str(n)] = rdf.Histo1D(("", "channel{n} min - mean".format(n=n), 100, 0, 40),

#trigger event loop

#draw histograms
1 Like

Hey thanks a lot that really helps!

But I have a question. Is “rdf” a new column with values that get overwritten everytime the loop goes to the next channel?
Because in the end I want to set two thresholds, say threshA and threshB, and look at specific channels, e.g. channel1 and channel2.
Then, IF the minimum value of the event in channel1 OR channel2 is greater than threshA, look at channel3 and channel4 and check IF the minimum value of the event is greater than threshB. Count all events for which this condition holds.
So I guess I would need four new columns to accomplish this?

rdf is a dataframe node. RDataFrame sets up a computation graph, a list of transformations and other event actions to perform, before turning into results of some kind.

rdf is not a column, but when you call the Define operation on a node, you get a new node which has ‘booked’ a new column with the defined actions, and when the event loop runs it will contain that column in that node and any inheriting from it.

The computation graph can be branched, and in separate branches you can define columns with the same name but different code, properties, even data types. However, in the simplest case, you create a linear graph. In a linear graph, every column defined should be uniquely named. In the sketch above, columns get the unique names from formatting the first string in each Define call:


In the rdf node after the loop, all of these columns will be available for further use.

If instead of using a loop to continuously redfine ‘rdf’ as ‘rdf.Define(something, code)’, but put the results of each triplet of Define calls into a dictionary, with the first Define call using ‘rdf_base’, then you would NOT be able to use all the columns at once, because the mean, min, and min_minus_mean of each channel would exist in one of 4 separate nodes of the computation graph.

Thanks for clarification!
I still have some problems understanding the concept. I still think of the DataFrame as a Frame with columns and rows. Not sure if it is the right way to think about it. I have never heard of the term computation graph.

How can I now retrieve the values in the columns, e.g the “channel0_min_minus_mean”?
Then, how do I compute: For each event look into channel0_min_minus_mean and see if value > a. If so look into channel1_min_minus_mean and see if value > b. If so add 1 to counter_channel1 and print values in channel1 (from initial dataframe).

Hi @Tim_Buktu ,
the RDataFrame user guide and the tutorials will hopefully clarify most of the concepts and provide useful usage examples.

It depends on what you want to do with them afterwards. There is a method to obtain those values as vectors (Take, see the docs linked above), but if you already know you will do some more processing and eventually extract some statistics from those values it might be simpler/more efficient to do everything within RDF.

If I understand correctly, for example you could do:

count = df.Filter("channel0_min_minus_mean > a && channel1_min_minus_mean > b")\

For the printing, it again depends on what you need to do with the printed values. It’s probably simplest to aggregate the values in an array and print the elements of the array afterwards (e.g. with Take, or AsNumpy).


Ok, thank you for your input, that really helped me.

Last question I tried to return “rdf” at the end of a function, in order to reuse it in other functions, but that did not seem to work. Is it generally possible?

Yes, that should not be a problem, e.g. this works:

import ROOT

def get_df():
    df = ROOT.RDataFrame(10)
    return df.Filter("rdfentry_ % 2 == 0")

if __name__ == "__main__":
    df = get_df()

Feel free to post a minimal, self-contained reproducer of your issue so we can take a look.


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.