ROOT.RDF.MakeNumpyDataFrame producing wrong histo


Please read tips for efficient and successful posting and posting code

ROOT Version: 6.26.0
Platform: Ubuntu 20.04
Compiler: gcc 10.3.0


Hi. I am using PyROOT to take my .root data to numpy arrays, use ML functions in scikit
and then plot some histo of produced numpy arrays by converting those numpy back to
RDataFrames using MakeNumpyDataFrame. However produced histograms are different
when plotted using matplotlib.pyplot (which seem to be right ones considering all the numbers)
and when plotted using RDataFrame.Histo1D()
I am attaching the PyROOT script here.
rf_script.py (10.4 KB)
I don’t know what is wrong ?

Hi @Chinmay ,
I cannot tell what is wrong from your description. Do value statistics (e.g. mean and standard deviation) of these arrays differ if you evaluate them with RDataFrame+MakeNumpyDataFrame or directly with numpy? Does df.Display().Print() show values that are different from what print(array) shows?

Cheers,
Enrico

Actually, above script requires 2 data files.

  1. The script reads in data from .root files in form of RDataFrame
  2. Applies some filters and some defines
  3. exports the defined columns to numpy array using RDataFrame.AsNumpy
  4. Trains the RandomForestClassifier using the scikit learn with above numpy array as input
  5. To evaluate performance of trained model I inspect classification error etc. which shows that
    trained model is good.
  6. To further confirm the performance, I am plotting histograms of ‘classification scores’ of signal events and background events. For this I am using numpy score arrays returned by scikit package. In one case I am plotting the histogram using matplotlib.pyplot and in one case I am converting score arrays
    to RDataFrame using MakeNumpyDataFrame and then using RDataFrame.Histo1D method. (Check the script from line number 153 to 183 )
    2 histograms generated with above 2 methods are different.
    The one generated with pyplot seems right.
    I am attaching two plots here.
    RFM1_root.pdf (22.2 KB)
    RFM1_pyplot.pdf (6.5 KB)

The orange one in pyplot-generated graph corresponds to red plot in root-generated file.

I checked this part, and it turns out it is so.

Alright, can you please provide a minimal reproducer that I can debug? E.g. some numpy arrays in a .pyz file that have different values when read into RDF?

Cheers,
Enrico

I can give you data files, on which you can run above script. They are in total 30 MB. Where can I share them ?
I think, the problem has something to do with statements,

np_signal =  np_target_prob[np_target_test == 1][:,0]
np_bkg = np_target_prob[np_target_test == -1][:,0]

The rhs of above statements probably return only ‘view’ objects of numpy arrays. I think , when I try to use MakeNumpyDataFrame on these ‘view’ objects it isn’t working as expected.
The same script can be used to train RandomForestRegression, which follows same steps as described in my 2nd post. I am not facing this problem in that case.

Ah, it could very well be. What if you instead do:

np_signal =  np_target_prob[np_target_test == 1][:,0].copy()
np_bkg = np_target_prob[np_target_test == -1][:,0].copy()

?

Using copy solves the problem.
Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.