Timing of kNN classification in TMVA

pnef · October 29, 2013, 11:00pm

Dear experts,

I’m trying to use the k-nearest neighbor algorithm to perform a signal vs background classification based on two input variables. I book the method in TMVA as follows:

factory.BookMethod( TMVA.Types.kKNN,"KNN20","H:nkNN=20:ScaleFrac=0.8:UseKernel=F:UseWeight=F:Trim=True:BalanceDepth=6" )

While this works well on a small number of events, I’m running into troubles when using a sizeable dataset of say O(1M) events. In particular, I find that creating the kd-tree takes a long time and ends with a segfault as follows:

--- KNN20                    : Creating kd-tree with 1738910 events
--- ModulekNN                : <Fill> Will trim all event types to 503886 events
--- ModulekNN                : <Fill> Erased 731138 events
--- ModulekNN                : Computing scale factor for 1d distributions: (ifrac, bottom, top) = (80%, 10%, 90%)
--- ModulekNN                : Optimizing tree for 2 variables with 1007772 values
--- ModulekNN                : <Fill> Class 1 has   503886 events
--- ModulekNN                : <Fill> Class 2 has   503886 events
--- KNN20                    : End of training
--- KNN20                    : Elapsed time for training with 1738910 events: 2.49e+03 sec
--- KNN20                    : Create MVA output for classification on training sample
--- KNN20                    : Evaluation of KNN20 on training sample (1738910 events)
Segmentation fault           : [.......................] (0%, time left: unknown)

When testing the time it needs for training (i.e. creating the optimized kd-tree) for a few different number of events, I find that the time grows much faster than the promised O(logN) (or worst case O(N)): see attached plot.

Did anybody observe a similar behaviour? Is this due to a misconfiguration on my side or a problem in the implementation?
Did anybody manage to run a kNN classification on O(1M) events without running into these problem?

Im running ROOT Version 5.34/07 and TMVA Version 4.1.4.

Thank you for your help,

Pascal
timing_lin.pdf (13.1 KB)

pnef · November 24, 2013, 11:02pm

For reference – in case somebody runs into the same problem:

After some more digging, I noticed that the problem only happens when the input variables in the kNN discrimination take discrete values. In my case, the two variables I’m using are continuous but show spikes at discrete values, causing many entries to have 0 “distance” from each other.

Rustem Ospanov, the author of the kNN implementation in TMVA, confirmed that the construction of the kd-tree is based on the assumption of continuously distributed variables. He proposed that I train the kNN discriminator on a subsample of events where the variables are continuous and handle the discrete cases separately. In my particular case, this work-around works well.