K-S probability is coming always 1 in BDT training and testing?

Hi!

When I am performing training and testing using TMVACrossValidation.C I always get K-S probability to be exactly 1. I think then I am doing something wrong! Could check then script once and I have also attached the log files corresponding to it.
Please let me know where I have to make the changes.
TMVACrossValidation.C (5.2 KB)

Processing TMVACrossValidation.C...
create data set info dataset
DataSetInfo              : [dataset] : Added class "Signal"
                         : Add Tree track of type Signal with 31613 events
DataSetInfo              : [dataset] : Added class "Background"
                         : Add Tree track of type Background with 66836 events
<HEADER> Factory                  : You are running ROOT Version: 6.20/02, Mar 15, 2020
                         : 
                         : _/_/_/_/_/ _|      _|  _|      _|    _|_|   
                         :    _/      _|_|  _|_|  _|      _|  _|    _| 
                         :   _/       _|  _|  _|  _|      _|  _|_|_|_| 
                         :  _/        _|      _|    _|  _|    _|    _| 
                         : _/         _|      _|      _|      _|    _| 
                         : 
                         : ___________TMVA Version 4.2.1, Feb 5, 2015
                         : 
                         : Building event vectors for type 2 Signal
                         : Dataset[dataset] :  create input formulas for tree track
                         : Building event vectors for type 2 Background
                         : Dataset[dataset] :  create input formulas for tree track
<HEADER> DataSetFactory           : [dataset] : Number of events in input trees
                         : Dataset[dataset] :     Signal     requirement: "ppgFitchi2<100 && ppbarInvMass>1.8 && ppbarInvMass<4.5"
                         : Dataset[dataset] :     Signal          -- number of events passed: 6208   / sum of weights: 6208 
                         : Dataset[dataset] :     Signal          -- efficiency             : 0.196375
                         : Dataset[dataset] :     Background requirement: "ppgFitchi2<100 && ppbarInvMass>1.8 && ppbarInvMass<4.5"
                         : Dataset[dataset] :     Background      -- number of events passed: 4831   / sum of weights: 4831 
                         : Dataset[dataset] :     Background      -- efficiency             : 0.0722814
                         : Dataset[dataset] :  you have opted for interpreting the requested number of training/testing events
                         :  to be the number of events AFTER your preselection cuts
                         : 
                         : Dataset[dataset] :  you have opted for interpreting the requested number of training/testing events
                         :  to be the number of events AFTER your preselection cuts
                         : 
                         : Number of training and testing events
                         : ---------------------------------------------------------------------------
                         : Signal     -- training events            : 6207
                         : Signal     -- testing events             : 1
                         : Signal     -- training and testing events: 6208
                         : Dataset[dataset] : Signal     -- due to the preselection a scaling factor has been applied to the numbers of requested events: 0.196375
                         : Background -- training events            : 4830
                         : Background -- testing events             : 1
                         : Background -- training and testing events: 4831
                         : Dataset[dataset] : Background -- due to the preselection a scaling factor has been applied to the numbers of requested events: 0.0722814
                         : 
<HEADER> DataSetInfo              : Correlation matrix (Signal):
                         : ----------------------------------------------------------------
                         :                shwrI pCosInppc ppbarLabAng ppgFitchi2 ppmFitchi2
                         :       shwrI:  +1.000    -0.001      -0.027     +0.007     +0.017
                         :   pCosInppc:  -0.001    +1.000      -0.004     +0.008     +0.022
                         : ppbarLabAng:  -0.027    -0.004      +1.000     +0.009     -0.010
                         :  ppgFitchi2:  +0.007    +0.008      +0.009     +1.000     -0.016
                         :  ppmFitchi2:  +0.017    +0.022      -0.010     -0.016     +1.000
                         : ----------------------------------------------------------------
<HEADER> DataSetInfo              : Correlation matrix (Background):
                         : ----------------------------------------------------------------
                         :                shwrI pCosInppc ppbarLabAng ppgFitchi2 ppmFitchi2
                         :       shwrI:  +1.000    +0.099      -0.125     +0.037     +0.006
                         :   pCosInppc:  +0.099    +1.000      -0.118     +0.047     -0.025
                         : ppbarLabAng:  -0.125    -0.118      +1.000     -0.013     -0.011
                         :  ppgFitchi2:  +0.037    +0.047      -0.013     +1.000     +0.014
                         :  ppmFitchi2:  +0.006    -0.025      -0.011     +0.014     +1.000
                         : ----------------------------------------------------------------
<HEADER> DataSetFactory           : [dataset] :  
                         : 
                         : 
                         : 
                         : ========================================
                         : Processing folds for method BDTG
                         : ========================================
                         : 
<HEADER> Factory                  : Booking method: BDTG_fold1
                         : 
<HEADER> BDTG_fold1               : #events: (reweighted) sig: 2733 bkg: 2733
                         : #events: (unweighted) sig: 3033 bkg: 2433
                         : Training 1000 Decision Trees ... patience please
                         : Elapsed time for training with 5466 events: 2.4 sec         
<HEADER> BDTG_fold1               : [dataset] : Evaluation of BDTG_fold1 on training sample (5466 events)
                         : Elapsed time for evaluation of 5466 events: 0.405 sec       
                         : Creating xml weight file: dataset/weights/TMVACrossValidation_BDTG_fold1.weights.xml
                         : Creating standalone class: dataset/weights/TMVACrossValidation_BDTG_fold1.class.C
<HEADER> Factory                  : Test all methods
<HEADER> Factory                  : Test method: BDTG_fold1 for Classification performance
                         : 
<HEADER> BDTG_fold1               : [dataset] : Evaluation of BDTG_fold1 on testing sample (5571 events)
                         : Elapsed time for evaluation of 5571 events: 0.409 sec       
<HEADER> Factory                  : Evaluate all methods
<HEADER> Factory                  : Evaluate classifier: BDTG_fold1
                         : 
<HEADER> BDTG_fold1               : [dataset] : Loop over test events and fill histograms with classifier response...
                         : 
                         : 
                         : Evaluation results ranked by best signal efficiency and purity (area)
                         : -------------------------------------------------------------------------------------------------------------------
                         : DataSet       MVA                       
                         : Name:         Method:          ROC-integ
                         : dataset       BDTG_fold1     : 0.971
                         : -------------------------------------------------------------------------------------------------------------------
                         : 
                         : Testing efficiency compared to training efficiency (overtraining check)
                         : -------------------------------------------------------------------------------------------------------------------
                         : DataSet              MVA              Signal efficiency: from test sample (from training sample) 
                         : Name:                Method:          @B=0.01             @B=0.10            @B=0.30   
                         : -------------------------------------------------------------------------------------------------------------------
                         : dataset              BDTG_fold1     : 0.846 (0.879)       0.925 (0.935)      0.968 (0.971)
                         : -------------------------------------------------------------------------------------------------------------------
                         : 
<HEADER> Factory                  : Thank you for using TMVA!
                         : For citation information, please visit: http://tmva.sf.net/citeTMVA.html
<HEADER> Factory                  : Booking method: BDTG_fold2
                         : 
<HEADER> BDTG_fold2               : #events: (reweighted) sig: 2785.5 bkg: 2785.5
                         : #events: (unweighted) sig: 3174 bkg: 2397
                         : Training 1000 Decision Trees ... patience please
                         : Elapsed time for training with 5571 events: 2.49 sec         
<HEADER> BDTG_fold2               : [dataset] : Evaluation of BDTG_fold2 on training sample (5571 events)
                         : Elapsed time for evaluation of 5571 events: 0.415 sec       
                         : Creating xml weight file: dataset/weights/TMVACrossValidation_BDTG_fold2.weights.xml
                         : Creating standalone class: dataset/weights/TMVACrossValidation_BDTG_fold2.class.C
<HEADER> Factory                  : Test all methods
<HEADER> Factory                  : Test method: BDTG_fold2 for Classification performance
                         : 
<HEADER> BDTG_fold2               : [dataset] : Evaluation of BDTG_fold2 on testing sample (5466 events)
                         : Elapsed time for evaluation of 5466 events: 0.409 sec       
<HEADER> Factory                  : Evaluate all methods
<HEADER> Factory                  : Evaluate classifier: BDTG_fold2
                         : 
<HEADER> BDTG_fold2               : [dataset] : Loop over test events and fill histograms with classifier response...
                         : 
                         : 
                         : Evaluation results ranked by best signal efficiency and purity (area)
                         : -------------------------------------------------------------------------------------------------------------------
                         : DataSet       MVA                       
                         : Name:         Method:          ROC-integ
                         : dataset       BDTG_fold2     : 0.969
                         : -------------------------------------------------------------------------------------------------------------------
                         : 
                         : Testing efficiency compared to training efficiency (overtraining check)
                         : -------------------------------------------------------------------------------------------------------------------
                         : DataSet              MVA              Signal efficiency: from test sample (from training sample) 
                         : Name:                Method:          @B=0.01             @B=0.10            @B=0.30   
                         : -------------------------------------------------------------------------------------------------------------------
                         : dataset              BDTG_fold2     : 0.846 (0.876)       0.926 (0.937)      0.961 (0.964)
                         : -------------------------------------------------------------------------------------------------------------------
                         : 
<HEADER> Factory                  : Thank you for using TMVA!
                         : For citation information, please visit: http://tmva.sf.net/citeTMVA.html
<HEADER> Factory                  : Booking method: BDTG
                         : 
                         : Reading weightfile: dataset/weights/TMVACrossValidation_BDTG_fold1.weights.xml
                         : Reading weight file: dataset/weights/TMVACrossValidation_BDTG_fold1.weights.xml
                         : Reading weightfile: dataset/weights/TMVACrossValidation_BDTG_fold2.weights.xml
                         : Reading weight file: dataset/weights/TMVACrossValidation_BDTG_fold2.weights.xml
                         : 
                         : 
                         : ========================================
                         : Folds processed for all methods, evaluating.
                         : ========================================
                         : 
<HEADER> Factory                  : [dataset] : Create Transformation "I" with events from all classes.
                         : 
<HEADER>                          : Transformation, Variable selection : 
                         : Input : variable 'shwrI' <---> Output : variable 'shwrI'
                         : Input : variable 'pCosInppc' <---> Output : variable 'pCosInppc'
                         : Input : variable 'ppbarLabAng' <---> Output : variable 'ppbarLabAng'
                         : Input : variable 'ppgFitchi2' <---> Output : variable 'ppgFitchi2'
                         : Input : variable 'ppmFitchi2' <---> Output : variable 'ppmFitchi2'
<HEADER> TFHandler_Factory        :    Variable           Mean           RMS   [        Min           Max ]
                         : --------------------------------------------------------------------------
                         :       shwrI:       3.4209       2.0218   [       1.0000       13.000 ]
                         :   pCosInppc:   0.00029079      0.52483   [     -0.99956      0.99959 ]
                         : ppbarLabAng:       91.978       48.062   [       0.0000       179.58 ]
                         :  ppgFitchi2:       29.362       28.332   [     0.016129       99.980 ]
                         :  ppmFitchi2:       125.16       1079.5   [   1.2375e-06       9999.0 ]
                         : --------------------------------------------------------------------------
                         : Ranking input variables (method unspecific)...
<HEADER> IdTransformation         : Ranking result (top variable is best ranked)
                         : ------------------------------------
                         : Rank : Variable    : Separation
                         : ------------------------------------
                         :    1 : ppbarLabAng : 4.335e-01
                         :    2 : ppgFitchi2  : 4.179e-01
                         :    3 : shwrI       : 3.474e-02
                         :    4 : pCosInppc   : 2.653e-02
                         :    5 : ppmFitchi2  : 1.065e-03
                         : ------------------------------------
                         : Elapsed time for training with 11037 events: 2.15e-06 sec         
<HEADER> BDTG                     : [dataset] : Evaluation of BDTG on training sample (11037 events)
                         : Elapsed time for evaluation of 11037 events: 0.632 sec       
                         : Creating xml weight file: dataset/weights/TMVACrossValidation_BDTG.weights.xml
                         : Creating standalone class: dataset/weights/TMVACrossValidation_BDTG.class.C
<WARNING> <WARNING>                : MakeClassSpecificHeader not implemented for CrossValidation
<WARNING> <WARNING>                : MakeClassSpecific not implemented for CrossValidation
<HEADER> Factory                  : Test all methods
<HEADER> Factory                  : Test method: BDTG for Classification performance
                         : 
<HEADER> BDTG                     : [dataset] : Evaluation of BDTG on testing sample (11037 events)
                         : Elapsed time for evaluation of 11037 events: 0.629 sec       
<HEADER> Factory                  : Evaluate all methods
<HEADER> Factory                  : Evaluate classifier: BDTG
                         : 
<HEADER> BDTG                     : [dataset] : Loop over test events and fill histograms with classifier response...
                         : 
<HEADER> TFHandler_BDTG           :    Variable           Mean           RMS   [        Min           Max ]
                         : --------------------------------------------------------------------------
                         :       shwrI:       3.4209       2.0218   [       1.0000       13.000 ]
                         :   pCosInppc:   0.00029079      0.52483   [     -0.99956      0.99959 ]
                         : ppbarLabAng:       91.978       48.062   [       0.0000       179.58 ]
                         :  ppgFitchi2:       29.362       28.332   [     0.016129       99.980 ]
                         :  ppmFitchi2:       125.16       1079.5   [   1.2375e-06       9999.0 ]
                         : --------------------------------------------------------------------------
                         : 
                         : Evaluation results ranked by best signal efficiency and purity (area)
                         : -------------------------------------------------------------------------------------------------------------------
                         : DataSet       MVA                       
                         : Name:         Method:          ROC-integ
                         : dataset       BDTG           : 0.970
                         : -------------------------------------------------------------------------------------------------------------------
                         : 
                         : Testing efficiency compared to training efficiency (overtraining check)
                         : -------------------------------------------------------------------------------------------------------------------
                         : DataSet              MVA              Signal efficiency: from test sample (from training sample) 
                         : Name:                Method:          @B=0.01             @B=0.10            @B=0.30   
                         : -------------------------------------------------------------------------------------------------------------------
                         : dataset              BDTG           : 0.847 (0.847)       0.925 (0.925)      0.965 (0.965)
                         : -------------------------------------------------------------------------------------------------------------------
                         : 
<HEADER> Dataset:dataset          : Created tree 'TestTree' with 11037 events
                         : 
<HEADER> Dataset:dataset          : Created tree 'TrainTree' with 11037 events
                         : 
<HEADER> Factory                  : Thank you for using TMVA!
                         : For citation information, please visit: http://tmva.sf.net/citeTMVA.html
                         : Evaluation done.
Summary for method BDT
	Fold 0: ROC int: 0.970872, BkgEff@SigEff=0.3: 0.968
	Fold 1: ROC int: 0.96947, BkgEff@SigEff=0.3: 0.961
==> Wrote root file: TMVA.root
==> TMVACrossValidation is done!
(int) 0


Regards
Souvik

__
Please read tips for efficient and successful posting and posting code

Please fill also the fields below. Note that root -b -q will tell you this info, and starting from 6.28/06 upwards, you can call .forum bug from the ROOT prompt to pre-populate a topic.

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided


Please see the overtraining check plot also.

Hello @souvik,

thanks for reaching out! How are you performing the K-S test in your code?

Cheers,
Monica

Thanks for your quick response.
I followed the thread from here: here
I have modified one line: FoldFileOutput=True in my cvOptions and then compared the training and testing distributions for each fold. I have attached the plots for both the folds.


I see now the K-S probability is not constantly equal to 1. Is your issue solved?

Yeah! if these are the real K-S probability that to be checked in training and testing.

Hi I have a query! If I divide my data in 2 fold while performing training and testing using TMVACrossValidation.C, I get two .xml files. Do I need to use these two .xml files while performing TMVACrossValidationApplication.C, I mean do I have to take the average of them?

You should use the file datasetcv/weights/TMVACrossValidation_BDTG.weights.xml.

Let me add @moneta in the loop, maybe he can help better.

Monica

But for my our case the data is divided into two-fold. I will have two .xml files as mentioned earlier.

Yes, the first script TMVACrossValidation.C should generate also the file datasetcv/weights/TMVACrossValidation_BDTG.weights.xml

okay. Thanks.