ROOT.RDF.MakeNumpyDataFrame producing wrong histo

Chinmay · April 20, 2022, 9:00am

Please read tips for efficient and successful posting and posting code

ROOT Version: 6.26.0
Platform: Ubuntu 20.04
Compiler: gcc 10.3.0

Hi. I am using PyROOT to take my .root data to numpy arrays, use ML functions in scikit
and then plot some histo of produced numpy arrays by converting those numpy back to
RDataFrames using MakeNumpyDataFrame. However produced histograms are different
when plotted using matplotlib.pyplot (which seem to be right ones considering all the numbers)
and when plotted using RDataFrame.Histo1D()
I am attaching the PyROOT script here.
rf_script.py (10.4 KB)
I don’t know what is wrong ?

eguiraud · April 20, 2022, 10:31am

Hi @Chinmay ,
I cannot tell what is wrong from your description. Do value statistics (e.g. mean and standard deviation) of these arrays differ if you evaluate them with RDataFrame+MakeNumpyDataFrame or directly with numpy? Does df.Display().Print() show values that are different from what print(array) shows?

Cheers,
Enrico

Chinmay · April 20, 2022, 11:54am

Actually, above script requires 2 data files.

The script reads in data from .root files in form of RDataFrame
Applies some filters and some defines
exports the defined columns to numpy array using RDataFrame.AsNumpy
Trains the RandomForestClassifier using the scikit learn with above numpy array as input
To evaluate performance of trained model I inspect classification error etc. which shows that
trained model is good.
To further confirm the performance, I am plotting histograms of ‘classification scores’ of signal events and background events. For this I am using numpy score arrays returned by scikit package. In one case I am plotting the histogram using matplotlib.pyplot and in one case I am converting score arrays
to RDataFrame using MakeNumpyDataFrame and then using RDataFrame.Histo1D method. (Check the script from line number 153 to 183 )
2 histograms generated with above 2 methods are different.
The one generated with pyplot seems right.
I am attaching two plots here.
RFM1_root.pdf (22.2 KB)
RFM1_pyplot.pdf (6.5 KB)

The orange one in pyplot-generated graph corresponds to red plot in root-generated file.

Chinmay · April 20, 2022, 12:35pm

I checked this part, and it turns out it is so.

eguiraud · April 20, 2022, 1:25pm

Alright, can you please provide a minimal reproducer that I can debug? E.g. some numpy arrays in a .pyz file that have different values when read into RDF?

Cheers,
Enrico

Chinmay · April 20, 2022, 4:39pm

I can give you data files, on which you can run above script. They are in total 30 MB. Where can I share them ?
I think, the problem has something to do with statements,

np_signal =  np_target_prob[np_target_test == 1][:,0]
np_bkg = np_target_prob[np_target_test == -1][:,0]

The rhs of above statements probably return only ‘view’ objects of numpy arrays. I think , when I try to use MakeNumpyDataFrame on these ‘view’ objects it isn’t working as expected.
The same script can be used to train RandomForestRegression, which follows same steps as described in my 2nd post. I am not facing this problem in that case.

eguiraud · April 20, 2022, 5:15pm

Ah, it could very well be. What if you instead do:

np_signal =  np_target_prob[np_target_test == 1][:,0].copy()
np_bkg = np_target_prob[np_target_test == -1][:,0].copy()

?

Chinmay · April 21, 2022, 6:48am

Using copy solves the problem.
Thanks.

system · May 5, 2022, 6:49am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.