Discrepancy between BTDG Overtraining plots produced by TMVAClassification and BDTG distribution on analysis sample

lattaud · March 4, 2019, 5:14pm

Dear TMVA experts,
I’m using in my analysis a BDTG to classify two different population of events.
I’m a bit confused by the BDTG output I get . The distribution observed with the over-training check (run by TMVA on test an training sample) and the distribution I observe on my analysis signal and background sample are quite different, see attached picturescheck_MVA.pdf (17.0 KB)
overtrain_BDTG_madgraph

I would like to know if it’s somehow expected and if not some hint on how to explain that effect.
Best regards,
Hugues Lattaud
PS : I’m using the root 6.08

kialbert · March 4, 2019, 10:01pm

Hi,

It’s difficult to do a comparison when the plots use different axes and binning, but I think the look largely the same. The behaviour of background is similar (sharply peaked at -1 and a small bump around 0.8). Likewise for signal but for the shape of the peak at 0.8. However, for some reason TMVA cuts it off at 0.85 it seems so we don’t see how that develops.

Do you handle event weights differently in the two cases? If you use explicit event weights when input to TMVA and use the same input data to generate ”check_mva.pdf” they should be weighted the same as well.

Cheers,
Kim

lattaud · March 5, 2019, 8:03am

Hi Kim,
The events weight are the same in both cases.
What I’m a bit concern about is the kind of double peak in the signal region, visible as well in signal and data in ”check_mva.pdf”. This structure is not shown in the over-training plots. Also the signal seems to be peak around 0.5 instead of 0.8 in the over-training plots. That what I want to know if I should be concern by.

When you say that TMVA cut at 0.85, is it a parameter that can be tuned or just a feature of the training of such MVA (BDTG)?
Thanks a lot for your answers.
Cheers,
Hugues

kialbert · March 5, 2019, 9:34am

Hi,

Indeed, for some reason I missed this in my first look. And indeed the behaviour is unexpected. I don’t have a clear idea of what could be the cause.

Could you clarify whether there is a difference between the TMVA training and test sample and your analysis signal and background samples? Could it be that these two datasets have different distributions due to e.g. small sample size?

How are you producing the check_MVA plot? What are your BDTG parameters?

EDIT: Unfortunately the cutoff point for the overtraining plots is hard-coded.

Cheers,
Kim

lattaud · March 5, 2019, 10:18am

Hi,
For test and training , events are picked randomly from the analysis dataset.
The background is estimated on data and signal on MC. In the analysis, I estimate the shape of the BDTG output in pt bins of the event’s leading photon. for example :

is the BDTG output for one of my pt bins. To produce it I follow the recipe :
Float_t NHiso_photon,Photoniso_photon,E1_3_photon,E2_2_photon,E2_5_photon,E5_5_photon,R9_photon,hadTowOverEm,etawidth_photon,phiwidth_photon,sigmaietaieta_photon;
TMVA::Reader *Access_weight = new TMVA::Reader( “!Color:!Silent” );

Access_weight-> AddVariable( “R9_photon” ,&R9_photon);
Access_weight-> AddVariable( “hadTowOverEm” ,&hadTowOverEm);
Access_weight-> AddVariable( “etawidth_photon” ,&etawidth_photon);
Access_weight-> AddVariable( “phiwidth_photon” ,&phiwidth_photon);
Access_weight-> AddVariable( “sigmaietaieta_photon” ,&sigmaietaieta_photon);
Access_weight-> AddVariable( “NHiso_photon” ,&NHiso_photon);
Access_weight-> AddVariable( “E2_5_photon/E5_5_photon” ,&E2_5_photon);
Access_weight-> AddVariable( “E2_2_photon/E5_5_photon” ,&E2_2_photon);
Access_weight-> AddVariable( “E1_3_photon/E5_5_photon” ,&E1_3_photon);
Access_weight-> AddVariable( “Photoniso_photon” ,&Photoniso_photon);
Access_weight-> BookMVA( “BDTG”, “/dataset/weights/TMVAClassification_BDTG.weights.xml” );

And in the event loop :
NHiso_photon = (Float_t) sig_var_NHiso_photon;
Photoniso_photon = (Float_t)sig_var_Photoniso_photon;
E1_3_photon = (Float_t)sig_var_E1_3_photon/sig_var_E5_5_photon;
E2_2_photon = (Float_t)sig_var_E2_2_photon/sig_var_E5_5_photon;
E2_5_photon = (Float_t)sig_var_E2_5_photon/sig_var_E5_5_photon;
R9_photon = (Float_t)sig_var_R9_photon;
hadTowOverEm = (Float_t)sig_var_hadTowOverEm ;
etawidth_photon = (Float_t)sig_var_etawidth_photon;
phiwidth_photon = (Float_t)sig_var_phiwidth_photon;
sigmaietaieta_photon = (Float_t)sig_var_sigmaietaieta_photon;
Double_t mvaValue = Access_weight->EvaluateMVA( “BDTG” );

I do that for signal and then background. All variables used are ID variable widely used to identify photons across the CMS collaboration.
Concerning the size of the sample in each bin, even in the highly populated bin the distribution are shifted and show a strange behaviour, the bin shown above is one of them.

Let me know if you need more information,
Thanks for your help,
Cheers,
Hugues

lattaud · March 11, 2019, 1:01pm

Hi again Kim, all

Maybe what I’ve sent was confusing. Just to clarify my question,
I’ve produced a plot where I compare the BTDG output from TMVA.root (automatically created by the TMVAClassification exe) and the output from the TMVA::Reader class :

The two histograms have exactly the same binning and are produced on the same events (signal MC).
If I use the stand alone class to produce the output I get the same results as TMVA::Reader class.
My question is why the distributions are not the same.
Any help is welcome.

Thanks a lot ,
Cheers,
Hugues

kialbert · March 11, 2019, 3:21pm

Hi,

Sorry for the delayed response. Could you post the configuration string you use for the training, e.g. what you pass to factory->BookMethod(...) and the contents of the following tags from the xml weight file:

<GeneralInfo>
<Options>
<Variables>
<Spectators>
<Classes>
<Transformations>

?

This should aid debugging where the difference lies. Thinking a little more on the problem the only “obvious” difference I can think of right now would be that maybe there are transformations applied to the input variables that are not reflected in the application code.

E.g. in the tutorial example there is a variable defined in as the sum of two TTree branches, var1+var2. In application code these must be manually added.

Cheers,
Kim

lattaud · March 11, 2019, 3:35pm

Hi,
here the configuration string :
// Boosted Decision Trees
if (Use[“BDTG”]) // Gradient Boost
factory->BookMethod( dataloader, TMVA::Types::kBDT, “BDTG”,
“!H:!V:NTrees=500:MinNodeSize=3.0%:BoostType=Grad:Shrinkage=0.10:UseBaggedBoost:BaggedSampleFraction=0.5:nCuts=20:MaxDepth=2” );

and here the requested tag in the xml weight file :
[/quote]
</GeneralInfo

<Option name="V" modified="Yes">False</Option>
<Option name="VerbosityLevel" modified="No">Default</Option>
<Option name="VarTransform" modified="No">None</Option>
<Option name="H" modified="Yes">False</Option>
<Option name="CreateMVAPdfs" modified="No">False</Option>
<Option name="IgnoreNegWeightsInTraining" modified="No">False</Option>
<Option name="NTrees" modified="Yes">500</Option>
<Option name="MaxDepth" modified="Yes">2</Option>
<Option name="MinNodeSize" modified="Yes">3.0%</Option>
<Option name="nCuts" modified="Yes">20</Option>
<Option name="BoostType" modified="Yes">Grad</Option>
<Option name="AdaBoostR2Loss" modified="No">quadratic</Option>
<Option name="UseBaggedBoost" modified="Yes">True</Option>
<Option name="Shrinkage" modified="Yes">1.000000e-01</Option>
<Option name="AdaBoostBeta" modified="No">5.000000e-01</Option>
<Option name="UseRandomisedTrees" modified="No">False</Option>
<Option name="UseNvars" modified="No">3</Option>
<Option name="UsePoissonNvars" modified="No">True</Option>
<Option name="BaggedSampleFraction" modified="Yes">5.000000e-01</Option>
<Option name="UseYesNoLeaf" modified="No">True</Option>
<Option name="NegWeightTreatment" modified="No">pray</Option>
<Option name="Css" modified="No">1.000000e+00</Option>
<Option name="Cts_sb" modified="No">1.000000e+00</Option>
<Option name="Ctb_ss" modified="No">1.000000e+00</Option>
<Option name="Cbb" modified="No">1.000000e+00</Option>
<Option name="NodePurityLimit" modified="No">5.000000e-01</Option>
<Option name="SeparationType" modified="No">giniindex</Option>
<Option name="RegressionLossFunctionBDTG" modified="No">huber</Option>
<Option name="HuberQuantile" modified="No">7.000000e-01</Option>
<Option name="DoBoostMonitor" modified="No">False</Option>
<Option name="UseFisherCuts" modified="No">False</Option>
<Option name="MinLinCorrForFisher" modified="No">8.000000e-01</Option>
<Option name="UseExclusiveVars" modified="No">False</Option>
<Option name="DoPreselection" modified="No">False</Option>
<Option name="SigToBkgFraction" modified="No">1.000000e+00</Option>
<Option name="PruneMethod" modified="No">nopruning</Option>
<Option name="PruneStrength" modified="No">0.000000e+00</Option>
<Option name="PruningValFraction" modified="No">5.000000e-01</Option>
<Option name="SkipNormalization" modified="No">False</Option>
<Option name="nEventsMin" modified="No">0</Option>
<Option name="UseBaggedGrad" modified="No">False</Option>
<Option name="GradBaggingFraction" modified="No">5.000000e-01</Option>
<Option name="UseNTrainEvents" modified="No">0</Option>
<Option name="NNodesMax" modified="No">0</Option>

<Variable VarIndex="0" Expression="R9_photon" Label="R9_photon" Title="R9_photon" Unit="" Internal="R9_photon" Type="D" Min="1.67512193e-01" Max="1.00000000e+00"/>
<Variable VarIndex="1" Expression="hadTowOverEm" Label="hadTowOverEm" Title="hadTowOverEm" Unit="" Internal="hadTowOverEm" Type="D" Min="0.00000000e+00" Max="7.99988434e-02"/>
<Variable VarIndex="2" Expression="etawidth_photon" Label="etawidth_photon" Title="etawidth_photon" Unit="" Internal="etawidth_photon" Type="D" Min="2.09414517e-03" Max="1.13418885e-01"/>
<Variable VarIndex="3" Expression="phiwidth_photon" Label="phiwidth_photon" Title="phiwidth_photon" Unit="" Internal="phiwidth_photon" Type="D" Min="2.60956842e-03" Max="8.05389047e-01"/>
<Variable VarIndex="4" Expression="sigmaietaieta_photon" Label="sigmaietaieta_photon" Title="sigmaietaieta_photon" Unit="" Internal="sigmaietaieta_photon" Type="D" Min="3.68645269e-05" Max="1.19999303e-02"/>
<Variable VarIndex="5" Expression="NHiso_photon" Label="NHiso_photon" Title="NHiso_photon" Unit="" Internal="NHiso_photon" Type="D" Min="0.00000000e+00" Max="1.15462128e+02"/>
<Variable VarIndex="6" Expression="E2_5_photon/E5_5_photon" Label="E2_5_photon/E5_5_photon" Title="E2_5_photon/E5_5_photon" Unit="" Internal="E2_5_photon_D_E5_5_photon" Type="D" Min="7.60071158e-01" Max="1.00000012e+00"/>
<Variable VarIndex="7" Expression="E2_2_photon/E5_5_photon" Label="E2_2_photon/E5_5_photon" Title="E2_2_photon/E5_5_photon" Unit="" Internal="E2_2_photon_D_E5_5_photon" Type="D" Min="3.64616752e-01" Max="1.00000000e+00"/>
<Variable VarIndex="8" Expression="E1_3_photon/E5_5_photon" Label="E1_3_photon/E5_5_photon" Title="E1_3_photon/E5_5_photon" Unit="" Internal="E1_3_photon_D_E5_5_photon" Type="D" Min="2.23410890e-01" Max="9.85555470e-01"/>
<Variable VarIndex="9" Expression="Photoniso_photon" Label="Photoniso_photon" Title="Photoniso_photon" Unit="" Internal="Photoniso_photon" Type="D" Min="0.00000000e+00" Max="1.49973087e+01"/>

<Class Name="Signal" Index="0"/>
<Class Name="Background" Index="1"/>

<Transformations NTransformations=“0”/

Let me know if you need other informations.
Thanks for your help,
Hugues
cheers,

kialbert · March 12, 2019, 10:26am

How do you define these variables in you dataloader and in your application script?

E2_5_photon/E5_5_photon,
E2_2_photon/E5_5_photon,
E1_3_photon/E5_5_photon

Cheers,
Kim

lattaud · March 12, 2019, 12:31pm

Hi,
in the dataloader they are defined as below :
dataloader->AddVariable( “E2_5_photon/E5_5_photon”, ‘D’ );
dataloader->AddVariable( “E2_2_photon/E5_5_photon”, ‘D’ );
dataloader->AddVariable( “E1_3_photon/E5_5_photon”, ‘D’ );

In my application :
Float_t E1_3_photon,E2_2_photon,E2_5_photon

Access_weight-> AddVariable( “E2_5_photon/E5_5_photon” ,&E2_5_photon);
Access_weight-> AddVariable( “E2_2_photon/E5_5_photon” ,&E2_2_photon);
Access_weight-> AddVariable( “E1_3_photon/E5_5_photon” ,&E1_3_photon);

E1_3_photon = (Float_t)(sig_var_E1_3_photon/sig_var_E5_5_photon);
E2_2_photon = (Float_t)(sig_var_E2_2_photon/sig_var_E5_5_photon);
E2_5_photon = (Float_t)(sig_var_E2_5_photon/sig_var_E5_5_photon);
Double_t mvaValue = Access_weight->EvaluateMVA( “BDTG” );

Cheers,
Hugues

kialbert · March 13, 2019, 3:02pm

Hi,

Thanks for being so forthcoming with information and sorry for missing that you had already posted the declaration of the variables.

I can’t find anything problematic with the information so far and I also doubled checked the output of TMVAClassification and TMVAClassificationApplication. There the results are consistent, so it seems to be something particular for your setup.

And, just to be clear, yes this is concerning. Internally TMVA uses the same evaluation path for both the training/testing evaluation and the evaluation through TMVA::Reader. As such I suspect that there is some mismatch in the input distributions of the two runs but

we checked the most common error of feeding data to the reader
the classifier is applied to the same data

Could there be a mismatch of weighting? So if you for example weigh the input trees in the DataLoader, but not when using the reader. (However, this should only effect the counts in the bins final output histogram, not the output score itself…)

Maybe if you could provide a minimal working example that reproduces the problem that could be easier to investigate.

Cheers,
Kim

lattaud · March 14, 2019, 1:49pm

Hi Kim,
I’ve worked on a small example that reproduce the issue.
In the archive I’ve attached, you’ll find two root files, signal and background, two repositories test_and_training and Application.

In signal.root, the events are weighted, the branch with the weight is called evtWeightTotA. Not sure that our problem is related to that, but I wanted to point that out in case it is usefull information.

In test_and_training you’ll find the TMVAClassification.cpp, the executable and the weight file in dataset/weight/, the BDTG has already been trained with all the events available in the signal and background root file. To recompile it, you might have to modify the makefile depending on your distribution.
To run it just do : ./TMVAClassification BDTG

In the Application, you’ll find Analyze.cpp/h and the executable, which is where the event loop is and where the reader is called. This application run on the same rootfiles used for the test/training of the BDTG.
To run it just do ./Analyze , it will create a rootfile TEST_off_all_dataset.root, inside you’ll find two histograms Sig_all and Bkg_all, that contain the shapes we"re interested in.

the archive is available from my google drive here : https://drive.google.com/open?id=14UhLxv2na6UVl4uWCLWo9bqmIeV7RU7W

Let me know if something goes wrong or if you need more information to run it.
Again thanks a lot for your patience and your help,
Cheers,
Hugues

kialbert · March 14, 2019, 3:25pm

Hi,

Many thanks for the example. After prodding a bit with this it seems that applying the bdt through the TMVA::Reader interface to the Train and Test data of TMVA.root yields the same results as during training.

This indicates to me that the distributions of the two inputs are different and that it is then expected that the output distributions can vary. It is then up to you to decide if this is acceptable or not for your given situation.

Cheers,
Kim

lattaud · March 14, 2019, 8:55pm

Hi Kim,
Thanks a lot for looking into that,
I’m a bit confused, the events stored in the TMVA.root after the training/testing are supposed to be events from signal.root and background .root (separated between the ones used for test and training ), right?
Does that mean that the TMVAClassification modify the distribution of the input variables before to feed the BDT?
In my understanding, it shouldn’t do that, am I correct ?
Does this mean I am doing something wrong?
cheers,
Hugues

kialbert · March 15, 2019, 4:50pm

Hi,

After some closer investigation I think it is related to the fact that sig_var_R9_photon is never assigned a value such as with sig_tree->SetBranchAddress("R9_photon",&sig_var_R9_photon);

Cheers,
Kim

lattaud · March 19, 2019, 8:17am

Hi Kim,
Thanks you for your patience, I did not spot that. What a silly mistake.
The shape are indeed well reproduced now that all the variables are used properly.
Sorry that I’ve bothered you with that.
Cheers,
Hugues