Hi,
I am trying to reproduce the values of the Kolmogorov-Smirnov test that is given in “(4b) Classifier Output Distributions (test and training samples superimposed)”. So far I tried two approaches that both fail to reproduce the values from (4b), i.e.
Signal: 0.052
Background: 0.236
My two approaches yield 0.2621, 0.4442 and 0.268, 0.446 for signal and background, respectively. These approaches are:
(1) I prepare a sorted list of all classifier values that I obtain from the “TestTree” and “TrainTree” in the TMVA output file. I split these lists w.r.t. their classID (signal / background) and fill these as pairs of two in TMath::KolmogorovTest . This yields the probabilities for signal (background): 0.2621 (0.4442). Due to my understanding this is the most precise way to calculate the Kolmogorov probability.
(2) I utilize TH1::KolmogorovTest and alter the binning in order to observe a convergence:
const string base = "tmvaOutput_LL_2012/";
const string methodName = "Fisher";
const size_t newBins = 1e4;
void KSTest(TFile *f, bool rebin) {
TH1D *sigTrain, *bkgTrain;
size_t sigBins, bkgBins;
Double_t sigLeft, sigRight, bkgLeft, bkgRight;
string histBase = base + "Method_" + methodName + "/" + methodName + "/MVA_" +
methodName + "_";
string trainHist_S = histBase + "Train_S";
string trainHist_B = histBase + "Train_B";
sigTrain = (TH1D *)f->Get(trainHist_S.c_str());
bkgTrain = (TH1D *)f->Get(trainHist_B.c_str());
sigBins = sigTrain->GetNbinsX();
sigLeft = sigTrain->GetBinLowEdge(1);
sigRight = sigTrain->GetBinLowEdge(sigBins + 1);
bkgBins = bkgTrain->GetNbinsX();
bkgLeft = bkgTrain->GetBinLowEdge(1);
bkgRight = bkgTrain->GetBinLowEdge(bkgBins + 1);
if (rebin) {
string trainTreeName = base + "/TrainTree";
TTree *train = (TTree *)f->Get(trainTreeName.c_str());
sigBins = bkgBins = newBins;
sigTrain = new TH1D("sigTrain", "sigTrain", sigBins, sigLeft, sigRight);
bkgTrain = new TH1D("bkgTrain", "bkgTrain", bkgBins, bkgLeft, bkgRight);
train->Project("sigTrain", methodName.c_str(), "classID==0");
train->Project("bkgTrain", methodName.c_str(), "classID==1");
}
string testTreeName = base + "/TestTree";
TTree *test = (TTree *)f->Get(testTreeName.c_str());
TH1D *sigTest = new TH1D("sigTest", "sigTest", sigBins, sigLeft, sigRight);
TH1D *bkgTest = new TH1D("bkgTest", "bkgTest", bkgBins, bkgLeft, bkgRight);
test->Project("sigTest", methodName.c_str(), "classID==0");
test->Project("bkgTest", methodName.c_str(), "classID==1");
cout << "Sig.: " << sigTrain->KolmogorovTest(sigTest) << '\n';
cout << "Bkg.: " << bkgTrain->KolmogorovTest(bkgTest) << endl;
// sigTest->DrawNormalized();
// sigTrain->DrawNormalized("same,E1");
// bkgTest->DrawNormalized();
// bkgTrain->DrawNormalized("same,E1");
}
This latter approach converges for a fine-granular binning (n = 1e5 bins) and eventually yields the probabilities for signal (background): 0.268 (0.446).
The plots that I obtain by using the default binning in this latter approach looks very similar to those of (4b), but the Kolmogorov probability differs strongly (n = 40, sig. = 0.6222 / bkg. = 0.862). I am very sure that the way I calculate this probability is wrong and TMVA is doing a good job, but somehow I am eager to find the flaw that I am doing.
Thank you for your help.