Dear all,
I have noticed that using a C++ random generator with RDataFrame produces somewhat inconsistent output depending on how many columns I define(?).
I don’t understand why.
I am curious if somebody else can reproduce this behaviour…
Is it a bug in my code, and I use a random generator incorrectly?
Reproducer
import ROOT
import numpy as np
ROOT.EnableImplicitMT()
ROOT.gInterpreter.Declare(
' ' '
#include "Math/Vector3D.h"
#include <ROOT/RVec.hxx>
#include <vector>
#include <algorithm>
#include <map>
#include <random>
std::random_device rd;
std::mt19937 e{rd()}; // or std::default_random_engine e{rd()};
std::normal_distribution<float> gaus(0., 1000.);
float generateGaussNumber(){
return gaus(e);
}
''')
print("Working with one column: ")
stds = []
for i in range(10):
data = ROOT.RDataFrame(10000000).Define("x", "generateGaussNumber()")
data = data.AsNumpy(["x"])
std = data["x"].std()
print(f"STD # {i} : {std:.5f}")
stds.append( std )
stds = np.array(stds)
print( "On average from 10 measurements I get STD: ", stds.mean(), "+-", stds.std() )
print("I want to define another column: ")
stds = []
for i in range(10):
data = ROOT.RDataFrame(10000000).Define("x", "generateGaussNumber()")
# Let me additionally add this column.
data = data.Define("y", "x*x")
data = data.Define("z", "x*x*x")
data = data.Define("p", "y*y*y*y")
data = data.AsNumpy(["x", "y", "z","p"])
std = data["x"].std()
print(f"STD # {i} : {std:.5f}")
stds.append( std )
stds = np.array(stds)
print( "On average from 10 measurements I get STD: ", stds.mean(), "+-", stds.std() )
Output
Working with one column:
STD # 0 : 1028.14478
STD # 1 : 1029.01526
STD # 2 : 1029.93323
STD # 3 : 1029.61365
STD # 4 : 1029.43274
STD # 5 : 1029.18030
STD # 6 : 1028.67773
STD # 7 : 1027.68787
STD # 8 : 1027.85901
STD # 9 : 1027.78064
On average from 10 measurements I get STD: 1028.7325 +- 0.7808765
I want to define another column:
STD # 0 : 1019.32581
STD # 1 : 1020.55157
STD # 2 : 1020.68372
STD # 3 : 1019.36206
STD # 4 : 1019.64514
STD # 5 : 1018.41907
STD # 6 : 1018.47748
STD # 7 : 1020.12415
STD # 8 : 1019.12787
STD # 9 : 1020.39062
On average from 10 measurements I get STD: 1019.6107 +- 0.77395064
The difference is significant. ~10 sigma.
Environment
ROOT Version: 6.28/10
Python 3.10.1
OS: Centos 7
g++ (Spack GCC) 12.2.0