Hi. I am using PyROOT to take my .root data to numpy arrays, use ML functions in scikit
and then plot some histo of produced numpy arrays by converting those numpy back to
RDataFrames using MakeNumpyDataFrame. However produced histograms are different
when plotted using matplotlib.pyplot (which seem to be right ones considering all the numbers)
and when plotted using RDataFrame.Histo1D()
I am attaching the PyROOT script here. rf_script.py (10.4 KB)
I don’t know what is wrong ?
Hi @Chinmay ,
I cannot tell what is wrong from your description. Do value statistics (e.g. mean and standard deviation) of these arrays differ if you evaluate them with RDataFrame+MakeNumpyDataFrame or directly with numpy? Does df.Display().Print() show values that are different from what print(array) shows?
The script reads in data from .root files in form of RDataFrame
Applies some filters and some defines
exports the defined columns to numpy array using RDataFrame.AsNumpy
Trains the RandomForestClassifier using the scikit learn with above numpy array as input
To evaluate performance of trained model I inspect classification error etc. which shows that
trained model is good.
To further confirm the performance, I am plotting histograms of ‘classification scores’ of signal events and background events. For this I am using numpy score arrays returned by scikit package. In one case I am plotting the histogram using matplotlib.pyplot and in one case I am converting score arrays
to RDataFrame using MakeNumpyDataFrame and then using RDataFrame.Histo1D method. (Check the script from line number 153 to 183 )
2 histograms generated with above 2 methods are different.
The one generated with pyplot seems right.
I am attaching two plots here. RFM1_root.pdf (22.2 KB) RFM1_pyplot.pdf (6.5 KB)
The orange one in pyplot-generated graph corresponds to red plot in root-generated file.
The rhs of above statements probably return only ‘view’ objects of numpy arrays. I think , when I try to use MakeNumpyDataFrame on these ‘view’ objects it isn’t working as expected.
The same script can be used to train RandomForestRegression, which follows same steps as described in my 2nd post. I am not facing this problem in that case.