Hello, I have one question wrt the interpretation of the K-S test.
First of all, as I understood from reading online, it is a test that determines if the two samples come from the same parent distribution. What does that mean exactly. How is it bad if they come or not?
Secondly, I read that the K-S test is used for examining whether there was overtraining. How do I check that? If the probability (for sig/bg) is close to zero is it that it is overtrained? Why is that?
When we have trained an ML method on a training data set we can also evaluate those same training samples to get some output. This training output will have some distribution. We can then run the trained method on a set of points that were left out of training, this is usually called the test set, to get another set of outputs.
If we compare the two outputs to each other, their distributions should be similar, if not the method is probably overfitting. (In the other case of underfitting, the performance will be equally bad on both the training and the test sets.)
This is the underlying reasoning. I’m a bit fuzzy on the details of the K-S test so you’ll have to interpret this according to your situation
Thank you for your response. I have one more question though. Where do the boundaries of overtraining lie? What I mean is, when the two distributions are acceptably similar (so we know that now over/undertraining has happened) and when they are not?
This I cannot help you with unfortunately. But basically I think you want the difference to be “statistically insignificant”. So if you can’t tell whether your data came from the training or the test set you’re fine. Otherwise you’ll have to motivate why the difference is ok.