Using random with RDataFrame

FoxWise · March 18, 2024, 3:35pm

Dear all,

I have noticed that using a C++ random generator with RDataFrame produces somewhat inconsistent output depending on how many columns I define(?).
I don’t understand why.

I am curious if somebody else can reproduce this behaviour…

Is it a bug in my code, and I use a random generator incorrectly?

Reproducer


import ROOT
import numpy as np
ROOT.EnableImplicitMT()


ROOT.gInterpreter.Declare(
' ' ' 
#include "Math/Vector3D.h"
#include <ROOT/RVec.hxx>
#include <vector>
#include <algorithm>
#include <map>
#include <random>

std::random_device rd;
std::mt19937 e{rd()}; // or std::default_random_engine e{rd()};
std::normal_distribution<float> gaus(0., 1000.);

float generateGaussNumber(){
    return gaus(e);
}

''')

print("Working with one column: ")
stds = []
for i in range(10):
    data = ROOT.RDataFrame(10000000).Define("x", "generateGaussNumber()")

    data = data.AsNumpy(["x"])
    std = data["x"].std()
    print(f"STD # {i} : {std:.5f}")
    stds.append( std )
stds = np.array(stds)
print( "On average from 10 measurements I get STD: ", stds.mean(), "+-", stds.std() )


print("I want to define another column: ")
stds = []
for i in range(10):
    data = ROOT.RDataFrame(10000000).Define("x", "generateGaussNumber()")

    # Let me additionally add this column.
    data = data.Define("y", "x*x")
    data = data.Define("z", "x*x*x")
    data = data.Define("p", "y*y*y*y")

    data = data.AsNumpy(["x", "y", "z","p"])
    std = data["x"].std()
    print(f"STD # {i} : {std:.5f}")
    stds.append( std )
stds = np.array(stds)
print( "On average from 10 measurements I get STD: ", stds.mean(), "+-", stds.std() )

Output

Working with one column: 
STD # 0 : 1028.14478
STD # 1 : 1029.01526
STD # 2 : 1029.93323
STD # 3 : 1029.61365
STD # 4 : 1029.43274
STD # 5 : 1029.18030
STD # 6 : 1028.67773
STD # 7 : 1027.68787
STD # 8 : 1027.85901
STD # 9 : 1027.78064
On average from 10 measurements I get STD:  1028.7325 +- 0.7808765
I want to define another column: 
STD # 0 : 1019.32581
STD # 1 : 1020.55157
STD # 2 : 1020.68372
STD # 3 : 1019.36206
STD # 4 : 1019.64514
STD # 5 : 1018.41907
STD # 6 : 1018.47748
STD # 7 : 1020.12415
STD # 8 : 1019.12787
STD # 9 : 1020.39062
On average from 10 measurements I get STD:  1019.6107 +- 0.77395064

The difference is significant. ~10 sigma.

Environment

ROOT Version: 6.28/10
Python 3.10.1
OS: Centos 7
g++ (Spack GCC) 12.2.0

FoxWise · March 18, 2024, 3:38pm

Without

ROOT.EnableImplicitMT()

results seem identical and unbiased as well. Is multithreading to blame somehow?

Working with one column: 
STD # 0 : 1000.07404
STD # 1 : 1000.04071
STD # 2 : 999.78540
STD # 3 : 999.90570
STD # 4 : 1000.38776
STD # 5 : 999.66693
STD # 6 : 1000.35681
STD # 7 : 1000.00623
STD # 8 : 999.21265
STD # 9 : 1000.02081
On average from 10 measurements I get STD:  999.9457 +- 0.32273895
I want to define another column: 
STD # 0 : 999.78998
STD # 1 : 999.71466
STD # 2 : 999.71759
STD # 3 : 1000.44550
STD # 4 : 1000.12805
STD # 5 : 1000.19830
STD # 6 : 999.87994
STD # 7 : 1000.06195
STD # 8 : 999.80225
STD # 9 : 1000.08185
On average from 10 measurements I get STD:  999.98206 +- 0.22809874

I am on a machine with 20 cores

Danilo · March 18, 2024, 6:28pm

Hi,

Thanks for the interesting post. I think the behaviour is not really due to RDF, but to multithreading. A global instance of the generator is being used, that triggers race conditions.
If you want to generate numbers in parallel with RDF, my advice would be to use an array/vector of generators that you access through the DefineSlot method.

I hope this helps.

Cheers,
D

FoxWise · March 20, 2024, 2:48pm

Hi Danilo,

Thanks for the hint!

I have fixed the problem for me, but using Define() and redslot_ in the function, e.g.:

    # using std::vector of engines
    data = df.Define("x", "generateGaussNumber(rdfslot_)")

I have tried to use DefineSlot(), but it failed for me…

The documentation says to use it as follows:

int function(unsigned int, double, double);
df.Define("x", function, {"rdfslot_", "column1", "column2"})
df.DefineSlot("x", function, {"column1", "column2"})

converting it into pyROOT
I have tried to use something similar, but all of the below resulted in the seg. fault for me:

df.Define("x", "generateGaussNumber", ["rdfslot_"])
df.Define("x", "generateGaussNumber()", ["rdfslot_"])
df.DefineSlot("x", "generateGaussNumber")
df.DefineSlot("x", "generateGaussNumber", [])

Is there any working example of how to use DefineSlot() with pyROOT?

Danilo · March 20, 2024, 2:51pm

Hi,

Thanks for sharing this nice progress!
I see you have a solution for the moment. On our side, we’ll look into the DefineSlot issues in Python. I am sorry you experienced those.

Best,
D

system · April 3, 2024, 2:51pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.