Overtraining check with CrossValidation

shu · July 26, 2019, 9:49am

Dear MVA experts,
Now I’m tring to use CrossValidation for some classification work. However, from the output file of the tutorial code here: https://root.cern/doc/master/TMVACrossValidation_8C.html, I found that the information in the output TrainTree and TestTree are completely the same. When I’m looking at the overtraining check plot from TMVAGui, the training part and testing part just overlap with each other.
I’m wondering how to deal with the issue if I want to CrossValidation class.
Many thanks in advance!

Cheers,
Shuyang

mwilkins · July 26, 2019, 9:15pm

I would like to second this question. I believe that the testing and training samples show up as identical in the plot because there are no separate “training” and “testing” samples on which to evaluate. But this begs the question: are there built-in/recommended ways to check for overtraining in CrossValidation within ROOT?

mwilkins · July 26, 2019, 9:26pm

Also note that section 3.2 in the TMVA Users Guide provides an overview of this subject, including some discussion of performance evaluation.

kialbert · August 14, 2019, 3:56pm

Hi,

Precisely so.

If a separate, external, test set is not appropriate, one can evaluate each fold in separation to investigate the property you are interested in.

When using cross validation in TMVA, a separate model is trained for each fold and saved in e.g. dataset/weights/YourModelName_Fold1.xml. These can then be used with the TMVA::Reader get the individual output (it is beneficial here to split on some external quantity e.g. an event ID so that the input to each fold is easily recreated).

Cheers,
Kim

mwilkins · August 14, 2019, 5:14pm

Hello,

The solution I arrived at is to use FoldFileOutput=True in the CrossValidation constructor and then compare the training and testing distributions in each fold. Unfortunately, this suppresses some useful output due to a bug which I reported here, though it hasn’t received any attention yet; thus I currently have to train twice: once to get the full printed output and once to get the distributions for each fold.

Thanks,
Michael

kialbert · August 15, 2019, 5:46pm

Great that you found a solution.

A fix for the problem you reported is currently under review.

Cheers,
Kim