I was just wondering if work was being done to enable multiclass capability with more methods, in particular Fisher Discriminant Analysis? From this tutorial it looks like only a handful of the methods have multiclass support? Are these the only methods which have multiclass support, or is there somewhere that lists all the methods that do or the methods that don’t?

Methods involving linear combinations of variables such as FDA are appealing in that what they do appears more transparent than methods involving hidden layers. From here it looks like multiclass FDA exists.

Hi,
I am not aware of a full list of the methods that support multi-class classification. Maybe @kialbert, who worked recently on improving the Multiclass support in TMVA knows this better.
however, if a method does not support multi-class it should give you a warning message when you are booking it.
The Fisher discriminant is for example not supporting multi-class. As you pointed out in the reference you provide, we could extend for supporting it. It is maybe something that external users can contribute to it, if you want to try doing it and create a PR for it, we will be grateful.

So in principle, multiclass Fisher/Linear Discriminant Analysis (LDA) should be achievable with Function Discriminant Analysis (FDA). When I try simulating a LDA with FDA as described on page 89 of the TMVA User’s Guide, I appear to get NaN coefficients. Because of this I have fallen back to a one-versus-rest methodology with the LDA binaries, as done in the attached macro.

One thing I am uncertain of, though, is if the option “NormMode” should be set to “None” to behave as the Multiclass method does? Or should weights be explicitly set for the class samples? Especially if the class samples vary in size. Could you or @kialbert comment on this? The question is if Multiclass does some sort of balancing of samples added from different class trees, that isn’t done in TMVA binary classification.

To exactly replicate the default weighting of the multiclass version you can use use NormMode=None and manually scale the each input tree by n_0/n_i where n_0 is the number of events in the first class and n_i is the number of events in the current class/tree (assuming there is one class per input tree).

Just to clarify from your response, are any of these equivalent to the default? Assuming chain0 is the chain for the one signal class with n0 events and every other chaini is a background class with ni events:

(1 & 2) With NormMode=None or NormMode=EqualNumEvents:

(1) and (2) are equivalent (and equivalent to TMVA NormMode=NumEvents).

(3), (4) and (5) are equivalent.

The difference between (1, 2) and (3, 4, 5) is that in the former the sum of the event weights for a class is equal to 1. In the latter the average event weight is ~approximately~ equal to 1 (in the case of very unbalanced classes this does not hold).

Interesting! Not what I was expecting. In case (3) why do the weights (1 / n_1, 1 / n_2) have the same effect as weights (n_0 / n_1, n_0 / n_2) in case (5)? The single signal tree of n_0 events is weighted the same in these cases.

And so the default multiclass behavior would be replicated in cases (1) and (2), yes?

So if I’m understanding correctly, x1, x2, x3, x4, x5 are supposed to correspond one-to-one with the cases (1, 2, 3, 4, 5) that I outlined above. Then is x4 corresponding to (4) the default behavior in mullticlass? If that is correct, is that the correct approach for very unbalanced classes? Or is x1 corresponding to (1) more reasonable? How are they different for training purposes with a Fisher discriminant analysis?

From our discussions I have updated the macro I shared earlier to reflect case (1).

If case (4) is what I should be using, I have to explicitly set the weights on the background samples to (n_0 / n_1, n_0 / n_2) and keep the signal sample weight set to the default value of 1, while setting NormMode=None. Correct?

To my understanding they are functionally equivalent, but there could be e.g. numerical issues that a particular algorithm is sensitive to. E.g. if event weights are small and the algorithm requires the square the number might underflow.

I am not familiar enough with the FDA/LDA to be able to say what goes here.