my question is simple: the validation sample in pyKeras, used for computing the val_loss (val_accuracy) at each epoch (in order to save only the best model and/or for early stopping), is the test sample defined in the TMVA::dataloader? Or is it a subsample of the training sample?
I don’t understand if this sample is selected by TMVA or Keras and with which criteria.
Yes, the keras interface uses the TMVA test set as the keras validation set.
Do note that the final ROC score output by TMVA at the end of
EvaluateAllMethods also use the test data for evaluating performance. This should be fine if your model complexity is low, (small or heavily regularised). To get unbiased estimates of performance you’d have to evaluate on separate data.
is it possible that this has changed? I stumbled upon this line in the training output when using TMVA with Keras:
Split TMVA training data in 5865462 training events and 1466365 validation events
So it seems like the TMVA training data is split into training and validation and the TMVA test data is kept as an independent evaluation sample? And if so, does anyone know if it is possible to specify the ratio of training/validation?
it is controlled by the
ValidationSize=xx% option in