TMath::KolmogorovTest with different sized arrays

prophecy · January 12, 2009, 1:21pm

I have been playing with the KS test for a bit now and was running into a problem where the binned KS test was always returning zero to me (or very close to zero). I decided after reading some of the documentation that I should go to the unbinned KS test in the TMath class. I started to use this and was getting basically the same results. Upon investigation of the code I see that if two arrays with different sizes are input then the loop is ran over the length of the shortest array and not over the whole of both datasets. Is this accurate? With this you are not running over the entire CDF range and just a small portion of it. Primarily if you have one dataset with 1000 entries and then another with 10000 entries then you are only seeing 1/10 of the entire CDF function which is probably just the lower left tail which is not sensitive in the KS test.

moneta · January 12, 2009, 1:57pm

Hi,
which ROOT version are you using ? The TMath function has been improved about one year ago for the case of different array sizes, see

root.cern.ch/root/htmldoc/TMath. … ogorovTest

Regards,

Lorenzo

prophecy · January 12, 2009, 2:13pm

  *******************************************
  *                                         *
  *        W E L C O M E  to  R O O T       *
  *                                         *
  *   Version   5.20/00      24 June 2008   *
  *                                         *
  *  You are welcome to visit our Web site  *
  *          http://root.cern.ch            *
  *                                         *
  *******************************************

ROOT 5.20/00 (trunk@24524, De 11 2008, 01:52:00 on linux)

CINT/ROOT C/C++ Interpreter version 5.16.29, Jan 08, 2008

I have been reading the current documentation. The problem that I discuss is for the current version of root. It is possible that not a lot of people try to compare datasets with such a different number of events in them and the problem is not as pronounced. The current implementation creates a running integral to estimate the delta CDF. I think that the onyl real way to get an accurate KS test is to compute the full CDF function of each distribution and then scan across the entire CDF function. This is much slower than what is implemented now but fixes the problems that I mention.

Justace

moneta · January 12, 2009, 2:41pm

I have not observed the problem you mention, even with very different data set.
I have not understood what you mean with:

the test creates the empirical CDF’s and then scan across all its range.
What do you mean with full CDF ?

Are you sure your data are compatible ? I would also check visually using for example a QQ plot (TGraphQQ)

Lorenzo[/quote]

prophecy · January 12, 2009, 3:42pm

After some investigation it seems that I have mis-read the code in root. I am now double checking.

Justace