Home | News | Documentation | Download

Overtraining/Overfitting in BDTs for HEP and the wider community


I’ve been looking into TMVA alongside some of the less HEP specific multivariate analysis tools (primarily using BDT’s like AdaBoost and gradient boosting such as XGBoost). In the wider non-HEP sense whenever overtraining/overfitting is discussed it is usually in the context of learning curves (e.g see, performance as a function of number of trees in XBGoost)

where over-training is said to be occurring when the performance on the test sample starts to get worse, despite the performance on the training sample improving. So in the above example above 30 Trees, we would say we are over training. At about 5 Trees the performance of test and train are very similar, but as the performance is improving for both so we are undertraining, i.e more training will improve both test and training performance.

However, in TMVA and HEP in general, overtraining is mainly used to describe any difference between test and training performance at all, with plots such as


being used to judge this being very common. However, this doesn’t seem to be common at all outside of HEP, where I struggled to find a single example of such practice that wasn’t TMVA. Similarly, in the wider community differences between the training/test performance (as measured by the absolute difference between the two lines in a learning curve for example) is more commonplace, so long as the performance on the independent test sample is well behaved (usually by studying learning curves or cross-validation), and the performance on the training samples are generally considered a poor metric of performance and aren’t used.

In a recent talk on modern TMVA by Stefan Wunsch seemed to suggest that this is a HEP thing but the context wasn’t 100% clear to me. Demanding that the test and train samples show similar shapes seems to me to be favoring under-training also. (https://indico.cern.ch/event/773049/contributions/3476171/attachments/1936050/3208338/CHEP_2019__Machine_Learning_with_ROOT_TMVA.pdf)

I guess I don’t have any particular questions, but just maybe to start some discussions about HEP specific trends/manners.


Indeed there’s no specific question, so it’s hard to give an answer. :smiley:
I maybe saw an implicit question, though:
“Is HEP wrongly measuring overtraining?”
To that I would say:
I don’t think so. Whereas the ML community chooses one metric (e.g. log loss), and plots it vs. training epochs, HEP snapshots the classifier response the last point of that graph. If they match, you can infer that training and test will yield the same performance. If they don’t, you are in the regime of overtraining. The way to visualise this just developed differently.

Hi Stephan,

No I agree, if training and test show the same response it is explicitly not overtraining, so I would never say the HEP is wrongly measuring overtraining, but it seems as though HEPs statement is more focused in that the reverse statement “test/train disagree = overtraining” is true, where as the the ML community this doesn’t follow directly from the log-loss curves.

I guess it comes down to a definition of overtraining.

By using a snapshot of the response at 1 point, HEP uses “Any difference in performance between test/train is overtraining”.

Where as the ML community seems to use “If the performance of the independent test sample gets worse with increased training, its overtraining”

So passing the HEP definition implies the ML definition, but not vice versa. Maybe its less about “is it overtraining.” and more about “its overtraining, but it doesn’t matter on performance of the independent test/validation samples”.