Overtraining test obtained from TMVA

Dear Sir,

I have a question about the KS test for overtraining check.
From this topic, it seems the KS test number should be close to 0.5 to avoid overtraining.

In my case the KS test for signal and background is always smaller than 0.2. See attached picture.
I trained with BDT. May I ask how can I avoid the overtraining? Through the understanding of KS test? Thanks!

Selection_070

Best,
Jung

Hi Jung!

Indeed the KS-test should return 0.5 for identically distributed data. To my understanding however the KS-test is rather weak in that it requires a large sample size to be effective.

In my quick test of 2 slightly separated normal distributions for signal and background using a shallow BDT with 800 trees the KS-test started being informative around a sample size of 100000.

To reduce overtraining, use either more data or regularise your BDT with e.g. simpler trees (depth=1 or 2), large MinNodeSize, and large nCuts. (As I assume you understand already looking at your table.)

You can also look into bagged boosting (UseBaggedBoost and BaggedSampleFraction) and feature subsampling with UseRandomisedTrees and UseNvars.

Cheers,
Kim

Dear Kim,

Thanks a lot for the reply! I got it.

Best,
Jung