Can someone help explain to me what the number of folds are that you specify for hyper parameter optimisation (HPO) in TMVA?
From my understanding (which could well be wrong):
HPO should work with cross validation (CV) implemented internally. The HPO method should choose some set of hyerparameters and perform CV with a given number of folds (n_folds), then take the average figure of merit (FoM) over the n_folds and return that as the ‘score’ for that particular set of hyperparameters, in order to avoid overtraining. If CV wasn’t used here then we could end up tuning our hyper parameters to the particular test/training set we have chosen. The HPO method would then vary the hyper parameters slightly and perform the same process, eventually giving you back the optimised parameters - the ones that give the best averaged FoM.
But in the TMVA implementation, you specify the number of folds, and then you get that number of different values for hyperparameters back. I don’t understand what the NumFolds are that you are setting here, as I thought all the CV should go on behind the scenes?
Thank you for your response.
So if the TMVA HPO implementation doesn’t do nested cross validation, I assume then that it performs it only once.
So, do you mean that it performs cross validation for each choice of hyperparameters? If so, why does it return multiple results?
I double checked the implementation and TMVA performs nested cross validation, it was just a bit hidden so I didn’t realise at first. However, it currently does the HPO part with fixed parameter limits.
This investigation also made me remember, it’s on our todo-list to open up the specification so that one can specify the parameter limits.
Okay - so I assume then that this NumFolds() option refers to the number of folds in the outer cross validation loop, seeing as that is the number of values for the hyper parameters that I get back.
How then would I change the underlying number of folds (of the inner loop)?
Hi, currently the default is to do a grid scan of the hyperparameters, but as you cannot, as a user, specify what the search space is currently you cannot control “the number of folds” in the inner loop.
I have one more questions about the figure of merit you can specify for the hyper parameter optimisation to optimise for. Online, I have found mention of ROC Integral and Separation. What does separation refer to: separation between test and training sets? Separation between signal and background? Or something else…
Also is there a way to specify the number of signal and background events to be used without specifying the number for testing and training?
I can only find an options string to go into PrepareTrainingAndTestTree function of the data loader of the kind “nSignal_test=1000:nSignal_train=1000:nBackground_test=1000:nBackground_train=1000”. Swapping to just “nSignal=" doesn’t work, and I don’t understand how specifying the number for testing and training is then compatible with cross validation? Surely the split between the 2 categories just depends on the number of folds in the internal loop?