Event mixing with RDF

riccardomanzoni · June 5, 2023, 9:03am

Dear all,

I would like to perform event mixing using RDataFrame.

What I have in mind is to take some quantities from the i-th row and some from the (i+1)-th row to create a fictitious “mixed” event.

In the example below, I want x from the i-th row and y from the (i+1)-th row

# example from 
# https://root-forum.cern.ch/t/saving-pandas-dataframe-as-ttree-with-rdataframe/42720/2

# Create a pandas dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['x'] = np.array([1, 2, 3])
df['y'] = np.array([4, 5, 6])

# Convert data to a dictionary with numpy arrays
data = {key: df[key].values for key in df.columns}

# Write the dictionary with numpy arrays to a ROOT file
import ROOT
rdf = ROOT.RDF.MakeNumpyDataFrame(data)

# Again, have a look!
rdf.Display().Print()

this returns

+-----+---+---+
| Row | x | y |
+-----+---+---+
| 0   | 1 | 4 |
+-----+---+---+
| 1   | 2 | 5 |
+-----+---+---+
| 2   | 3 | 6 |
+-----+---+---+

my goal is to obtain

+-----+---+---+
| Row | x | y |
+-----+---+---+
| 0   | 1 | 5 |
+-----+---+---+
| 1   | 2 | 6 |
+-----+---+---+
| 2   | 3 | 4 |
+-----+---+---+

(in case, I wouldn’t care if the first/last rows are clipped because they are at the boundaries of the row range)

Do you have any suggestions how to achieve this in a smart way?

Thanks!
Riccardo

vpadulan · June 5, 2023, 8:54pm

Dear @riccardomanzoni ,

Let me make an extreme sempliifcation here. The execution of computations in RDataFrame could be boiled down to

for (auto i = 0; i < tree.GetEntries(); i++){
    tree.GetEntry(i)
    run_computations(tree)
}

Of course there are many clever things around it, but this is just to give the idea that it traverses the input dataset, one entry at a time, for the columns that you need in your application. All the columns will be queried with the same entry number. So I don’t see a clear way to implement your use case directly within the existing API and machinery. In principle we could add an action that lags the values of a certain column and then following calls to the API in the same computation graph branch would see the lagged values. I don’t think that will be high priority for the time being.

As a not-so-efficient workaround, you could think about preparing your input dataset before creating the RDataFrame, i.e. having the correctly shifted arrays as input dataset to RDF.

Cheers,
Vincenzo

eguiraud · June 5, 2023, 10:18pm

Hi,

see Event mixing with RDataFrame for an example of a simple sliding window implementation.

Cheers,
Enrico

system · June 19, 2023, 10:18pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.