Dear MVA experts,
Now I’m tring to use CrossValidation for some classification work. However, from the output file of the tutorial code here: https://root.cern/doc/master/TMVACrossValidation_8C.html, I found that the information in the output TrainTree and TestTree are completely the same. When I’m looking at the overtraining check plot from TMVAGui, the training part and testing part just overlap with each other.
I’m wondering how to deal with the issue if I want to CrossValidation class.
Many thanks in advance!
I would like to second this question. I believe that the testing and training samples show up as identical in the plot because there are no separate “training” and “testing” samples on which to evaluate. But this begs the question: are there built-in/recommended ways to check for overtraining in CrossValidation within ROOT?
Also note that section 3.2 in the TMVA Users Guide provides an overview of this subject, including some discussion of performance evaluation.
If a separate, external, test set is not appropriate, one can evaluate each fold in separation to investigate the property you are interested in.
When using cross validation in TMVA, a separate model is trained for each fold and saved in e.g.
dataset/weights/YourModelName_Fold1.xml. These can then be used with the
TMVA::Reader get the individual output (it is beneficial here to split on some external quantity e.g. an event ID so that the input to each fold is easily recreated).
The solution I arrived at is to use
FoldFileOutput=True in the
CrossValidation constructor and then compare the training and testing distributions in each fold. Unfortunately, this suppresses some useful output due to a bug which I reported here, though it hasn’t received any attention yet; thus I currently have to train twice: once to get the full printed output and once to get the distributions for each fold.
Great that you found a solution.
A fix for the problem you reported is currently under review.