How does TMVA calculate the Kolmogorov probability?

denvercoder9 · February 2, 2018, 2:39pm

Hi,

I am trying to reproduce the values of the Kolmogorov-Smirnov test that is given in “(4b) Classifier Output Distributions (test and training samples superimposed)”. So far I tried two approaches that both fail to reproduce the values from (4b), i.e.

Signal: 0.052
Background: 0.236

My two approaches yield 0.2621, 0.4442 and 0.268, 0.446 for signal and background, respectively. These approaches are:

(1) I prepare a sorted list of all classifier values that I obtain from the “TestTree” and “TrainTree” in the TMVA output file. I split these lists w.r.t. their classID (signal / background) and fill these as pairs of two in TMath::KolmogorovTest . This yields the probabilities for signal (background): 0.2621 (0.4442). Due to my understanding this is the most precise way to calculate the Kolmogorov probability.

(2) I utilize TH1::KolmogorovTest and alter the binning in order to observe a convergence:

const string base = "tmvaOutput_LL_2012/";
const string methodName = "Fisher";
const size_t newBins = 1e4;

void KSTest(TFile *f, bool rebin) {
  TH1D *sigTrain, *bkgTrain;
  size_t sigBins, bkgBins;
  Double_t sigLeft, sigRight, bkgLeft, bkgRight;

  string histBase = base + "Method_" + methodName + "/" + methodName + "/MVA_" +
                    methodName + "_";
  string trainHist_S = histBase + "Train_S";
  string trainHist_B = histBase + "Train_B";

  sigTrain = (TH1D *)f->Get(trainHist_S.c_str());
  bkgTrain = (TH1D *)f->Get(trainHist_B.c_str());

  sigBins = sigTrain->GetNbinsX();
  sigLeft = sigTrain->GetBinLowEdge(1);
  sigRight = sigTrain->GetBinLowEdge(sigBins + 1);

  bkgBins = bkgTrain->GetNbinsX();
  bkgLeft = bkgTrain->GetBinLowEdge(1);
  bkgRight = bkgTrain->GetBinLowEdge(bkgBins + 1);

  if (rebin) {
    string trainTreeName = base + "/TrainTree";
    TTree *train = (TTree *)f->Get(trainTreeName.c_str());

    sigBins = bkgBins = newBins;
    sigTrain = new TH1D("sigTrain", "sigTrain", sigBins, sigLeft, sigRight);
    bkgTrain = new TH1D("bkgTrain", "bkgTrain", bkgBins, bkgLeft, bkgRight);
    train->Project("sigTrain", methodName.c_str(), "classID==0");
    train->Project("bkgTrain", methodName.c_str(), "classID==1");
  }

  string testTreeName = base + "/TestTree";
  TTree *test = (TTree *)f->Get(testTreeName.c_str());

  TH1D *sigTest = new TH1D("sigTest", "sigTest", sigBins, sigLeft, sigRight);
  TH1D *bkgTest = new TH1D("bkgTest", "bkgTest", bkgBins, bkgLeft, bkgRight);
  test->Project("sigTest", methodName.c_str(), "classID==0");
  test->Project("bkgTest", methodName.c_str(), "classID==1");

  cout << "Sig.: " << sigTrain->KolmogorovTest(sigTest) << '\n';
  cout << "Bkg.: " << bkgTrain->KolmogorovTest(bkgTest) << endl;

  //  sigTest->DrawNormalized();
  //  sigTrain->DrawNormalized("same,E1");
  //  bkgTest->DrawNormalized();
  //  bkgTrain->DrawNormalized("same,E1");
}

This latter approach converges for a fine-granular binning (n = 1e5 bins) and eventually yields the probabilities for signal (background): 0.268 (0.446).

The plots that I obtain by using the default binning in this latter approach looks very similar to those of (4b), but the Kolmogorov probability differs strongly (n = 40, sig. = 0.6222 / bkg. = 0.862). I am very sure that the way I calculate this probability is wrong and TMVA is doing a good job, but somehow I am eager to find the flaw that I am doing.

Thank you for your help.

Axel · February 5, 2018, 5:12pm

@moneta maybe you can help here?

denvercoder9 · February 5, 2018, 6:20pm

Thanks for helping me out. In the meantime I found how the value is calculated. It uses the ‘X’ option of TH1::KolmogorovTest. I think that was done for performance considerations? Anyhow, I pasted a small example in my /afs/cern.ch/work/n/nmeinert/public/ks_test, including a sample root file such that you can test it. If you have access you can run root -l ks_test.C (code given below). It shows a significant difference between using this ‘X’ option (ks_test1()) and approach (1) from my first post (ks_test2()). Is my implementation correct?

The output:

> root -l ks_test.C 
root [0] 
Processing ks_test.C...
KS-TEST #1
 * TH1::KolmogorovTest(...): 0.622333
 * TH1::KolmogorovTest(..., "X"): 0.049

KS-TEST #2
 * TMath::KolmogorovTest(...): 0.262144
 * [low res.] TH1::KolmogorovTest(...): 0.622333
 * [low res.] TH1::KolmogorovTest(..., "X"): 0.047
 * [high res.] TH1::KolmogorovTest(...): 0.267935
root [1] .q

The file ks_test.C

constexpr const char METHOD[] = "Fisher";

TH1 *getTestHistFromFile(TFile *f) {
  return (TH1 *)f->Get("tmvaOutput_LL_2012/Method_Fisher/Fisher/MVA_Fisher_S");
}

TH1 *getTrainHistFromFile(TFile *f) {
  return (TH1 *)f->Get(
      "tmvaOutput_LL_2012/Method_Fisher/Fisher/MVA_Fisher_Train_S");
}

TTree *getTestTreeFromFile(TFile *f) {
  return (TTree *)f->Get("tmvaOutput_LL_2012/TestTree");
}

TTree *getTrainTreeFromFile(TFile *f) {
  return (TTree *)f->Get("tmvaOutput_LL_2012/TrainTree");
}

string getMethodName() { return "Fisher"; }

void ks_test1(TFile *f) {
  TH1 *htest = getTestHistFromFile(f);
  TH1 *htrain = getTrainHistFromFile(f);
  cout << " * TH1::KolmogorovTest(...): " << htest->KolmogorovTest(htrain)
       << '\n';
  cout << " * TH1::KolmogorovTest(..., \"X\"): "
       << htest->KolmogorovTest(htrain, "X") << endl;
}

void ks_test2(TFile *f) {
  TTreeReader trainReader(getTrainTreeFromFile(f));
  TTreeReader testReader(getTestTreeFromFile(f));

  const string method = METHOD;
  auto fill = [&method](TTreeReader &reader, vector<Double_t> &values,
                        bool signal) {
    TTreeReaderValue<Float_t> classifier(reader, method.c_str());
    TTreeReaderValue<Int_t> type(reader, "classID");
    while (reader.Next()) {
      if (signal ? *type == 0 : *type != 0) {
        values.push_back(static_cast<Double_t>(*classifier));
      }
    }
    sort(values.begin(), values.end());
    reader.Restart();
  };
  vector<Double_t> test, train;
  fill(testReader, test, true);
  fill(trainReader, train, true);

  cout << " * TMath::KolmogorovTest(...): "
       << TMath::KolmogorovTest(test.size(), &test[0], train.size(), &train[0],
                                "")
       << endl;

  TH1 *href = getTestHistFromFile(f);
  auto bins = href->GetNbinsX();
  auto left = href->GetBinLowEdge(1);
  auto right = href->GetBinLowEdge(bins + 1);

  TH1 *htest = new TH1I("htest", "htest", bins, left, right);
  TH1 *htrain = new TH1I("htrain", "htrain", bins, left, right);
  testReader.GetTree()->Project("htest", METHOD, "classID==0");
  testReader.Restart();
  trainReader.GetTree()->Project("htrain", METHOD, "classID==0");
  trainReader.Restart();
  cout << " * [low res.] TH1::KolmogorovTest(...): "
       << htest->KolmogorovTest(htrain) << '\n';
  cout << " * [low res.] TH1::KolmogorovTest(..., \"X\"): "
       << htest->KolmogorovTest(htrain, "X") << endl;

  TH1 *htest_high = new TH1I("htest_high", "htest_high", 1e5, left, right);
  TH1 *htrain_high = new TH1I("htrain_high", "htrain_high", 1e5, left, right);
  testReader.GetTree()->Project("htest_high", METHOD, "classID==0");
  testReader.Restart();
  trainReader.GetTree()->Project("htrain_high", METHOD, "classID==0");
  trainReader.Restart();
  cout << " * [high res.] TH1::KolmogorovTest(...): "
       << htest_high->KolmogorovTest(htrain_high) << '\n';
}

void ks_test() {
  TFile *f = TFile::Open("tmvaOutput_LL_2012.root", "READ");

  cout << "KS-TEST #1\n";
  ks_test1(f);

  cout << '\n';

  cout << "KS-TEST #2\n";
  ks_test2(f);
}

moneta · February 6, 2018, 9:30am

Hi,

It is normal that can be a significant different between option “X” and standard KS tests.
It is known that for histograms (i.e. binned data) the KS test returns a non-uniform p-value distribution for the null, this is due to the fact that the KS test should be applied to unbinned data and not histograms.
In particular (as mentioned in Note3 of https://root.cern.ch/doc/v608/classTH1.html#aeadcf087afe6ba203bcde124cfabbee4 ) the obtained p-value will be biased to greater values (as it is in your case)
Using Option “X”, the p-value is re-calibrated using pseudo-experiments, where the distribution of the KS test statistics for the null hypothesis is constructed.

Lorenzo

denvercoder9 · February 6, 2018, 10:17am

Dear Lorenzo,
thank you for your help. I got your point that obtaining the KS probability from binned data introduces a bias, but from my numbers it seems that the bias from the sampling approach is much worse. (assuming that my way to calculate the unbiased value is correct)

moneta · February 6, 2018, 2:38pm

How are you sure that the p-value obtained from sampling the histogram is biased ?
Did you try to obtain the correct un-binned KS probability ?

Lorenzo

denvercoder9 · February 6, 2018, 2:44pm

yes. I showed these numbers above.

moneta · February 6, 2018, 3:07pm

Hi,

For a bias you have to show the distribution of the obtained p-values. I don’t understand what you can deduce from just the number above.

Lorenzo

denvercoder9 · February 6, 2018, 3:23pm

Dear Lorenzo,
sorry for this confusion; my problem is the following:
I calculated the unbiased value:

unbiased: 0.262144

whereas TMVA (that uses the TH1::KolmogorovTest(..., "X")) yields:

TMVA: 0.049

i.e. O(10) less. Out of curiosity I calculated TH1::KolmogorovTest(..., "") with a binning of n=40 (that is the default value for TMVA):

binned data (n = 40): 0.622333

of course this value is biased, but nevertheless, way more precise than the value from TMVA. In order to check these numbers I increased the binning from n = 40...1e5 and observed a nice convergence:

binned data (n = 1e5): 0.267935

i.e. very close to the unbiased value.

And finally my question: how come that the TMVA approach is much worse than the unbiased or even the naive approach using two histograms with default binning (no ‘X’ option)? What am I doing wrong?

moneta · February 6, 2018, 5:13pm

Thank you very much !
I understand now clearly your number.
The problem could be then with the “X” option, that maybe is buggy or it does not work for this case. It is something to test

Lorenzo

denvercoder9 · February 6, 2018, 5:19pm

Dear Lorenzo,

thank you for your patience.

I agree. Furthermore, I don’t understand why TMVA does not use the unbiased approach. I think it is not very time consuming and the complete trees are saved in the root file.

moneta · February 6, 2018, 5:24pm

Hi ,

I fully agree that TMVA should use the unbinned data, since they are available. It should anyway be faster than generating toys.
We will change this

Lorenzo

Andrej-CMS · March 2, 2020, 4:21pm

@moneta
Coming back to this issue.
Has the calculation of the KS probability in TMVAGUI ever been changed?

If I look at the version v610, there still seem to be the calcuation with the “X” option: https://root.cern/doc/v610/mvas_8cxx_source.html

moneta · March 2, 2020, 4:46pm

Hi,

The calculation of the KS probability is still done with the X option. It is true it is something we need to revise and eventually improve

Lorenzo

moneta · March 3, 2020, 4:10pm

Hi,
I have found a bug in TH1::KolmogorovTest when using the "X’ option. One of the histogram wa snot re-generated causing a bias in the distribution of the distance. This caused a op-value significantly bias on the lower side, as this post indicates.

Attached is a figure obtained from histogram drawn from uniform distribution. You would expect flat p-values, instead for TH1::KolmogorovTest you see a slightly bias topwards 1, that is expected (top plot), while for `TH1::KolmogorovTest(option=‘X’) a strong bias to lower values (middle plot).
The lower plot shows the corrected result

Lorenzo

p-values.pdf (19.5 KB)

Andrej-CMS · March 6, 2020, 11:11am

Hi @moneta,

Thanks a lot for making the study! These are good news.
Are there any plans on releasing a newer version of TMVA with this patch?

Kind regards,
Andrej

moneta · March 6, 2020, 12:23pm

Hi,

I am planning to add the patch to the new 6.20.02 that will be released in a 1-2 weeks

Lorenzo