# Alghoritm for Machine learning

Good morning, I hope you can help me because I’m just starting to study machine learning. I simulated 2 samples A,B. I’m interested in sample A and I wanna reject sample B. I plotted sample A and sample B in function of distance from a point and I see that since distance 0 untill distance 1 Km I’ve more A than B; instead, since 1Km I’ve more B than A, then I wanna write an alghoritm based on this observation. How should I do? Thank you

Hi,
if that’s all the information you have about your samples, then your best algorithm is one that classifies everything as A before 1km and everything as B after 1km, and you don’t need machine learning to find that algorithm.

A typical machine learning problem that sounds kind of similar to what you describe is to distinguish between two type of stochastic variables A and B where the probability distributions of A and B are unkown or the parameters of their probability distributions are unknown. For low dimensionality of the distributions or multi-dimensional stochastic variables with little correlation between dimensions you might want to train a boosted decision tree (BDT) classifier, a system that learns to “distinguish” samples coming from probability distribution A or B. For complex, high-dimensional, highly correlated data a neural network is usually the go-to algorithm (it’s another kind of trainable classifier). These are supervised learning algorithms, i.e. they require that you show the algorithm many samples of both A and B distributions and you tell them which is which, so they can “learn” to classify correctly.

Another option is to fit parametric probability distributions to the data and then use this knowledge to draw conclusions on the shape of distributions A and B and the probability that a sample belongs to A or B (you can look e.g. at gaussian mixture models as a simple such system that assumes a multi-variate gaussian distribution for A and B). This is an unsupervised learning algorithm, it does not require that you tell it if a sample comes from A or B.

In any case I’m not sure the ROOT forum is the best place for these kind of questions: a dedicated ML forum or mailing list would be full of experts on the topic.

Hope this helps,
Enrico

Hi, have Sample A and Sample B
For each sample I’ve 4 types of particles : particle 1, particle 2 , particle 3 and particle 4

The information (by simulation) that I have are:

1. In Sample B, particles are distributed on an area larger than in Sample A
2. For distance larger than 1Km, particle densities of Sample B si higher than particle densities of Sample B
3. Particle 3 and 4 densities in Sample B are ALWAYS higher than in sample A
4. Particle 1 and 2 densities in Sample A are higher than particle 1 and 2 densities in sample A untill a distance “d” and lower from this distance
5. Particle 3 number is lower than particle 2 in Sample A, but particle 3 number is similiar than particle 3 number in Sample B

Do you know how to make an alghortim (and if it’s possible a simple ML code )?

Hi,
I’m still not sure whether you want to do classification or clustering.
In general and as a first step you should not write an algorithm yourself but you can use a library like scikit-learn which implements many common algorithms and try them out on your dataset.

The kind of information you need to decide what algorithm to use is not necessarily the specifics of how the data is distributed, but “meta” information like the one I mentioned in my answer: do you know what kind of probability distribution would fit the data? do you have labels for each data-point? what is the dimensionality of the data? how would the input to your algorithm look like? 10 fairly uncorrelated real numbers? an image? 1000 highly correlated real numbers? etcetera.