Dear experts,
I’m trying to use the k-nearest neighbor algorithm to perform a signal vs background classification based on two input variables. I book the method in TMVA as follows:
factory.BookMethod( TMVA.Types.kKNN,"KNN20","H:nkNN=20:ScaleFrac=0.8:UseKernel=F:UseWeight=F:Trim=True:BalanceDepth=6" )
While this works well on a small number of events, I’m running into troubles when using a sizeable dataset of say O(1M) events. In particular, I find that creating the kd-tree takes a long time and ends with a segfault as follows:
--- KNN20 : Creating kd-tree with 1738910 events
--- ModulekNN : <Fill> Will trim all event types to 503886 events
--- ModulekNN : <Fill> Erased 731138 events
--- ModulekNN : Computing scale factor for 1d distributions: (ifrac, bottom, top) = (80%, 10%, 90%)
--- ModulekNN : Optimizing tree for 2 variables with 1007772 values
--- ModulekNN : <Fill> Class 1 has 503886 events
--- ModulekNN : <Fill> Class 2 has 503886 events
--- KNN20 : End of training
--- KNN20 : Elapsed time for training with 1738910 events: 2.49e+03 sec
--- KNN20 : Create MVA output for classification on training sample
--- KNN20 : Evaluation of KNN20 on training sample (1738910 events)
Segmentation fault : [.......................] (0%, time left: unknown)
When testing the time it needs for training (i.e. creating the optimized kd-tree) for a few different number of events, I find that the time grows much faster than the promised O(logN) (or worst case O(N)): see attached plot.
Did anybody observe a similar behaviour? Is this due to a misconfiguration on my side or a problem in the implementation?
Did anybody manage to run a kNN classification on O(1M) events without running into these problem?
Im running ROOT Version 5.34/07 and TMVA Version 4.1.4.
Thank you for your help,
- Pascal
timing_lin.pdf (13.1 KB)