RDataFrame in pyROOt: using Define() to add a random number to each entry of a column

Hi! I’m working on a function to “smear” my dataset, adding a randomly generated uncertainty to each entry in one or more columns of an RDataFrame.

I have in some sense managed to do this, importing c++ code via gInterpreter and then using the Define() function, but it applies the same random number to every entry. The number changes if run the whole code again, so I guess it’s because Define doesn’t run the function element-wise. Here is the relevant code:

ROOT.gInterpreter.ProcessLine('#include "addrand.h"') df = dataframe.Define("{}_smeared".format(to_smear), "addrand_gauss(time[0])")

The addrand.h file contains this (written in c++):

#include <iostream>
#include <random>
#include <vector>
#include <ctime>

double addrand_gauss(double x, double t_resol=1) {
	std::default_random_engine gen(std::time(nullptr));
	std::normal_distribution<double> nd(0, t_resol); 
	x += nd(gen);
	return x;
}

I did try using ForEach(), which would run the function for each entry, but ipython replies “‘RDataFrame’ object has no attribute ‘ForEach’”. I see in other posts in this forum that pyROOT doesnt yet support ForEach.

Is there any way I can work around this? I have thought about using the AsNumpy() function, but since my dataset is quite big, it takes extremely long to process.

Any suggestions are very welcome!
Luna

ROOT Version: 6.22/06
Platform: Linux (WSL)
Compiler:
ROOT installed through conda forge

Hi Luna,
Define runs for every entry. I think the problem is:

For every entry, that line sets the seed to the current second, and given that many entries are processed per second, they will all have the same random sequence. You should set the seed outside of the function.

Cheers,
Enrico

Hi Enrico,

Thank you! It now works like a dream. I have a small follow-up question: if Define also runs for every entry, what is then the difference between Define and ForEach? Only that the first one makes a new branch and the second one doesn’t?

Yes, more precisely:

Define returns a new dataframe object with that extra column defined. Nothing is computed at the point you call Define, no loop over data is started. When computation eventually starts (the first time one of the RDataFrame results is accessed), Defines are called as needed: for example, if an entry does not pass a Filter, it is immediately skipped and some Defines or other operations might be skipped.

ForEach returns nothing and immediately starts the loop over events.

Cheers,
Enrico

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.