BDT output from xml file does not correspond to histogram from TMVA.root

David_Vannerom · June 10, 2021, 1:25pm

Dear all,

I want to retrieve the classifier value of a trained BDT over a signal data set. To do this, I use pyROOT the following way:

ROOT.TMVA.Tools.Instance()
reader = ROOT.TMVA.Reader( “!Color:!Silent” )
cosTheta = array.array(‘f’,[0])
reader.AddVariable(“cosTheta”, cosTheta)
… # add other variables

reader.BookMVA(“BDT method”,“dataset/weights/TMVAClassification_BDT_general.weights.xml”)

for i in range(len(events)):
cosTheta[0] = np.cos(Theta[i])
… # fill other variables
bdtOutput = reader.EvaluateMVA(“BDT method”)

This works fine in the sense that it returns a value for the BDT classifier, but the histogram that I then get looks nothing like what was produced when making the xml file (i.e. in the TMVA GUI). I produced the xml file by running tutorials/tmva/TMVAClassification.C, using half the signal sample to train, the other half to test. I am now trying to get the BDT output for all events in this very same signal sample. I was expecting the result to look exactly the same but it’s completely different, in particular it’s not a smooth distribution: it has several peaks.

Thank you for the help!

ROOT Version: 6.24/00
Built for linuxx8664gcc on May 21 2021, 23:47:00
From heads/latest-stable@v6-24-00-1-ge6a04a86cb

David_Vannerom · June 11, 2021, 2:54pm

Any idea? Anyone? My problem is that I don’t get the same distribution of the bdt output when I compare what I get from the TMVA GUI when training/testing the BDT, and when I apply it on the same data set using the xml file.

moneta · June 11, 2021, 3:12pm

Hi,

This is strange. Are you are you are using the a same sets of data when using the Reader or looking at the output produced during training and examined with the GUI ?
If you can add some macros and data file showing this problem, it will be helpful too

Cheers

Lorenzo

David_Vannerom · June 11, 2021, 3:46pm

So for instance, I tried with only 2 variables named cosTheta and conf15. The distribution of the BDT output on the train tree and the test tree look very similar. I paste here the test tree to give you an idea. Now, when I add the bdtOutput to my data tree from reading the xml file using the reader, I only ever get two values, and I therefore get this other histogram.

The way I retrieve the bdtOutput is as described above, with the recommended method using the reader. These are the same variables that were used to make the xml file, in the same order, etc.

David_Vannerom · June 14, 2021, 9:37am

No-one has ever seen such a problem before?

moneta · June 14, 2021, 10:19am

Hi,
Something has gone wrong. It is difficult to say what without looking at your code and your input file.
Can you please post them, you could do privately to me in case you cannot share them

Lorenzo

David_Vannerom · June 14, 2021, 1:28pm

Hello Lorenzo,

Thanks for your response. My events are stored in a hdf5 file. So I use pyROOT to loop over all events in the file, recover the bdtOutput and add it to the file as an extra dictionary:

import h5py
import matplotlib.pyplot as plt
import numpy as np
from modules.plot import *
from modules.utils import *
from modules.DecayWidths import FullWidth
import math
from tqdm import tqdm
import sys, argparse
import ROOT
import array

# Define input arguments
parser = argparse.ArgumentParser(description='Plot the selected quantity for the selected particle')
parser.add_argument('-i', '--inputfile', type=str, default='190606_L7_995.h5', help='Name of the file')
args = parser.parse_args()

inputfile = args.inputfile

f = h5py.File(inputfile,'a')

ROOT.TMVA.Tools.Instance()
# Create the Reader object
reader = ROOT.TMVA.Reader( "!Color:!Silent" )
# Create a set of variables and declare them to the reader
# - the variable names MUST corresponds in name and type to those given in the weight file(s) used
cosTheta = array.array('f',[0])
conf_15m = array.array('f',[0])

reader.AddVariable("cosTheta", cosTheta)
reader.AddVariable("conf_15m", conf_15m)

# Book the MVA methods
reader.BookMVA("BDT method","dataset/weights/TMVAClassification_BDT_cosTheta_conf15.weights.xml")

### Conditions
# Condition basic from MVA
cdt_basic = (f['cascade0TaupedeDict']['Energy'] > 0) & (f['cascade1TaupedeDict']['Energy'] > 0) & (f['millipedeDict']['E_cascade0_15m'] > 0) & (f['millipedeDict']['E_cascade1_15m'] > 0) & (f['cascade0TaupedeDict']['Energy'] < 1000) & (f['cascade1TaupedeDict']['Energy'] < 1000) & (f['millipedeDict']['E_tot'] < 5000) & (f['taupedeDict']['chiSquared']/f['taupedeDict']['chiSquared_dof'] < 200) & (f['taupedeDict']['bestFitLength'] > 0) & (f['taupedeDict']['bestFitLength'] < 800)
# Cascade not nan
cdt_nan = np.logical_not(np.isnan(f['cascade0Dict']['Energy'])) & np.logical_not(np.isnan(f['cascade1Dict']['Energy']))
# This is "no condition" (always True)
cdt0 = np.ones(len(f['neutrinoDict']['Energy']), dtype=bool)
# Final condition
cdt = cdt_nan & cdt_basic

# Get variables from file
f_Zenith = f['cascade0TaupedeDict']['Zenith'][cdt]
f_E0_millipede_15m = f['millipedeDict']['E_cascade0_15m'][cdt]
f_E1_millipede_15m = f['millipedeDict']['E_cascade1_15m'][cdt]
f_Etot_millipede = f['millipedeDict']['E_tot'][cdt]

# Declare new container for the MVA output
bdtOutput = []

# Loop through events in h5 file
for i in range(len(f['cascade0TaupedeDict']['Energy'][cdt])):
    cosTheta[0] = np.cos(f_Zenith[i])
    conf_15m[0] = (f_E0_millipede_15m[i]+f_E1_millipede_15m[i])/f_Etot_millipede[i]

    print(cosTheta[0],conf_15m[0])
    print(reader.EvaluateMVA("BDT method"))
    bdtOutput.append(reader.EvaluateMVA("BDT method"))

del f['bdtOutput']
f.create_dataset('bdtOutput', data=bdtOutput)
f.close()

David_Vannerom · June 14, 2021, 1:32pm

Here is the xml file (in txt format just for the upload). Unfortunately, I cannot upload the hdf5 file containing the data here because the forum does not accept this format.
TMVAClassification_BDT_cosTheta_conf15.weights.txt (1.6 MB)

David_Vannerom · June 15, 2021, 8:32am

I can also show you the output of the code snippet I pasted two comments ago:

-0.46815961599349976 1.0
-0.008333333333333333
-0.24187009036540985 0.9829396605491638
-0.013888888888888888
-0.9035863876342773 1.0
-0.008333333333333333
0.12428360432386398 0.6378335952758789
-0.013888888888888888
-0.7843955159187317 0.1967046558856964

The two values on the same row are the cosTheta and conf_15m values, followed at the following line by the corresponding value of the bdtOutput. You can see that there are only ever two different values for the bdtOutput, and that it seems to depend on whether conf_15m is equal to 1.0 or not. I looked at my code and compared it against the few examples I’ve found online and it really seems correct. I correctly recover the cosTheta and conf_15m values, but somehow the bdtOutput does not vary correctly as a function of these two variables.

moneta · June 15, 2021, 10:25am

Hi,
You can post a link to the hdf5 file if you cannot attach, by using for exemple cernbox or sharebox.
From what I see it looks like the BDT is maybe not well trained.
I would need to see and being able to run macro used for trainiing the BDT and input training data

Lorenzo

moneta · June 15, 2021, 10:42am

Hi,
I have tested using the new RReader interface, which will be easier to use from Python, see
https://root.cern/doc/master/classTMVA_1_1Experimental_1_1RReader.html

and I am getting different values. Here is the code example:

using namespace TMVA::Experimental;

void example_RReader() {

   RReader model("TMVAClassification_BDT_cosTheta_conf15.weights.xml");

   auto variables = model.GetVariableNames();

   auto prediction = model.Compute({-0.46815961599349976, 1.0 });   
   std::cout << "Single-event inference: " << prediction[0] << "\n\n";

   prediction = model.Compute({ 0.24187009036540985, 0.9829396605491638});   
   std::cout << "Single-event inference: " << prediction[0] << "\n\n";

   prediction = model.Compute({ 0.12428360432386398, 0.6378335952758789});   
   std::cout << "Single-event inference: " << prediction[0] << "\n\n";

   
}

David_Vannerom · June 15, 2021, 10:53am

Hello,

Here are the links to the TMVA macro and the h5 data file:

https://cernbox.cern.ch/index.php/s/rTd9rJ9HYvZ2RSi
https://cernbox.cern.ch/index.php/s/g51UmS6CEjowCTC

The RReader solution is interesting, how should I use this in pyROOT? You’re saying it’s easier than the standard Reader?

moneta · June 15, 2021, 11:59am

Hi,
Thanks for the file. Here is a Python example of the code above:

import ROOT
import numpy as np

model = ROOT.TMVA.Experimental.RReader("TMVAClassification_BDT_cosTheta_conf15.weights.xml")

x = np.array([[-0.46815961599349976, 1.0 ],
              [0.24187009036540985, 0.9829396605491638],
              [0.12428360432386398, 0.6378335952758789]],
             dtype='float32')
shape = np.array([3,2])
input = ROOT.TMVA.Experimental.RTensor('float')(x,shape)

prediction = model.Compute(input)
print (prediction)

Then there is also this tutorial, but in C++ , see ROOT: tutorials/tmva/tmva003_RReader.C File Reference

Lorenzo

David_Vannerom · June 15, 2021, 9:04pm

Hello Lorenzo,

I’ve tried the other method but got the exact same result: I only ever get two values for the bdt output, whatever the loaded variables. Have you got any chance on your side,

Thanks a lot for your help!
David

moneta · June 16, 2021, 9:47am

Hi David,

But what is the output if you run the code I have posted above ?

David_Vannerom · June 16, 2021, 11:59am

Hello Lorenzo,

This is the output with your code:

{ 0.0692261, -0.102214, -0.105236 }

But when I use it with the data from my h5 file, I get a segmentation fault, and then it prints a large vector with all the outputs that, again, are only these two numbers I ever get.

moneta · June 16, 2021, 12:53pm

Hello,
Exactly the same values I am having too. So the question is why when using the old TMVA:Reader you are getting different values {-0.008333333333333333, -0.013888888888888888, -0.013888888888888888}
I will check also using the Reader

Lorenzo

moneta · June 17, 2021, 4:00pm

Hi ,
I have tested also using the Reader class, with this code :

import ROOT
import numpy as np
import array

x = np.array([[-0.46815961599349976, 1.0 ],
              [0.24187009036540985, 0.9829396605491638],
              [0.12428360432386398, 0.6378335952758789]],
             dtype='float32')

cosTheta = array.array('f',[0])
conf_15m = array.array('f',[0])

reader = ROOT.TMVA.Reader( "!Color:!Silent" )

reader.AddVariable( "cosTheta", cosTheta );
reader.AddVariable( "conf_15m", conf_15m);

methodName = "BDT method";
weightfile = "TMVAClassification_BDT_cosTheta_conf15.weights.xml";

reader.BookMVA( methodName, weightfile )


for i in range(0,3):
    cosTheta[0] = x[i,0]
    conf_15m[0] = x[i,1]
    print(cosTheta[0],conf_15m[0])
    val = reader.EvaluateMVA( "BDT method")
    print (val)

and I am getting the same values as before:

(-0.46815961599349976, 1.0)
                         : Rebuilding Dataset Default
0.0692261175229
(0.24187009036540985, 0.9829396605491638)
-0.102214014319
(0.12428360432386398, 0.6378335952758789)
-0.105235559162

so I don’t see any problem.

Lorenzo

system · July 1, 2021, 4:01pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.