Bradley-Fayyad-Reina algorithm implementation using TMVA

PRISHIta123 · January 8, 2020, 2:18pm

I would like to try implementing the BFR Algorithm for cluster analysis. It makes use of 3 datasets:
1.The retained set (RS) The set of data points which are not recognized to belong to any cluster, and need to be retained in the buffer;
2.The discard set (DS) The set of data points which can be discarded after updating the summary statistics;
3.The compression set (CS) The set of summary statistics which are representative of each cluster.

Each data point is then assigned to one of these sets on the basis of its local Mahalanobis distance from the center of each cluster with respect to its sample covariance matrix.

It would be helpful if I could receive some direction as to how I could get started with this using the TMVA functions.

kialbert · January 27, 2020, 3:38pm

Hi,

We’re happy to hear that you are interesting in extending TMVA! The first order of business would be to figure out how to fit this into the TMVA model (maybe @moneta could provide some insight? ). This could be a little problematic to incorporate in TMVA (maybe it’s possible to implement as a regression task?).

For incorporation into TMVA you need to implement a new MethodNameGoesHere (e.g. MethodBFRCluster).

The framework provides access to the Training and Test sets as specified in the DataLoader, with these you can split and format the data as you wish, e.g. split into three internal sets.

You can have a look at MethodCrossValidation for a small-ish implementation of a Method.

Cheers,
Kim

PRISHIta123 · January 27, 2020, 4:33pm

Thanks for the reply @kialbert. I think the reason why the BFR clustering method would be useful, when the K Means clustering method is already available is because it can be faster to implement when faced with large datasets. It would be very helpful if you can help me get in touch with @moneta to discuss the steps involved with regards to the implementation. Also thanks a lot for sharing the TMVA functions that I could use for this task.