Fisher discriminant TMVA

Hello everybody!

I have a question which is a mixture of programming and statistics.

I have a program which generates events from two bidimensional gaussian, one is the signal, the other the background. I need to perform a MVA technique to separate the two classes of events and I have decided to use a Fisher discriminants with TMVA package.

I think the implementation in my code is correct, but I report it here just for checking:

    TFile f("fisher.root", "RECREATE"); 
	TMVA::Factory * factory = new TMVA::Factory("TMVAanalysis", &f, "");
	TMVA::DataLoader * dataloader = new TMVA::DataLoader ("data");

	dataloader -> AddSignalTree(sgl); 
	dataloader -> AddBackgroundTree(bkg);

	dataloader -> AddVariable("x", 'F');
	dataloader -> AddVariable("y", 'F');

	factory -> BookMethod(dataloader, TMVA::Types::kFisher, "Fisher", "");
	factory -> TrainAllMethods();
	factory -> TestAllMethods();
	factory -> EvaluateAllMethods();

sgl and bkg are TNtuple which contain the generation of x and y for signal and background.

I get the following output:

<HEADER> DataSetInfo              : [data] : Added class "Signal"
                         : Add Tree sgl of type Signal with 50000 events
<HEADER> DataSetInfo              : [data] : Added class "Background"
                         : Add Tree bkg of type Background with 50000 events
<HEADER> Factory                  : Booking method: Fisher
                         : 
<HEADER> Factory                  : Train all methods
<HEADER> DataSetFactory           : [data] : Number of events in input trees
                         : 
                         : 
                         : Dataset[data] : Weight renormalisation mode: "EqualNumEvents": renormalises all event classes ...
                         : Dataset[data] :  such that the effective (weighted) number of events in each class is the same 
                         : Dataset[data] :  (and equals the number of events (entries) given for class=0 )
                         : Dataset[data] : ... i.e. such that Sum[i=1..N_j]{w_i} = N_classA, j=classA, classB, ...
                         : Dataset[data] : ... (note that N_j is the sum of TRAINING events
                         : Dataset[data] :  ..... Testing events are not renormalised nor included in the renormalisation factor!)
                         : Number of training and testing events
                         : ---------------------------------------------------------------------------
                         : Signal     -- training events            : 25000
                         : Signal     -- testing events             : 25000
                         : Signal     -- training and testing events: 50000
                         : Background -- training events            : 25000
                         : Background -- testing events             : 25000
                         : Background -- training and testing events: 50000
                         : 
<HEADER> DataSetInfo              : Correlation matrix (Signal):
                         : ------------------------
                         :                x       y
                         :       x:  +1.000  +0.496
                         :       y:  +0.496  +1.000
                         : ------------------------
<HEADER> DataSetInfo              : Correlation matrix (Background):
                         : ------------------------
                         :                x       y
                         :       x:  +1.000  +0.398
                         :       y:  +0.398  +1.000
                         : ------------------------
<HEADER> DataSetFactory           : [data] :  
                         : 
<HEADER> Factory                  : [data] : Create Transformation "I" with events from all classes.
                         : 
<HEADER>                          : Transformation, Variable selection : 
                         : Input : variable 'x' <---> Output : variable 'x'
                         : Input : variable 'y' <---> Output : variable 'y'
<HEADER> TFHandler_Factory        : Variable        Mean        RMS   [        Min        Max ]
                         : -----------------------------------------------------------
                         :        x:     1.9871     2.1206   [   -0.99690     6.9825 ]
                         :        y:     1.9934     2.1265   [   -0.99390     6.9700 ]
                         : -----------------------------------------------------------
                         : Ranking input variables (method unspecific)...
<HEADER> IdTransformation         : Ranking result (top variable is best ranked)
                         : --------------------------
                         : Rank : Variable  : Separation
                         : --------------------------
                         :    1 : x         : 9.971e-01
                         :    2 : y         : 9.967e-01
                         : --------------------------
<HEADER> Factory                  : Train method: Fisher for Classification
                         : 
<HEADER> Fisher                   : Results for Fisher coefficients:
                         : -----------------------
                         : Variable:  Coefficient:
                         : -----------------------
                         :        x:       -1.316
                         :        y:       -1.326
                         : (offset):       +5.258
                         : -----------------------
                         : Elapsed time for training with 50000 events: 0.0171 sec         
<HEADER> Fisher                   : [data] : Evaluation of Fisher on training sample (50000 events)
                         : Elapsed time for evaluation of 50000 events: 0.00868 sec       
                         : Creating xml weight file: data/weights/TMVAanalysis_Fisher.weights.xml
                         : Creating standalone class: data/weights/TMVAanalysis_Fisher.class.C
<HEADER> Factory                  : Training finished
                         : 
                         : Ranking input variables (method specific)...
<HEADER> Fisher                   : Ranking result (top variable is best ranked)
                         : ----------------------------
                         : Rank : Variable  : Discr. power
                         : ----------------------------
                         :    1 : y         : 7.878e-01
                         :    2 : x         : 7.867e-01
                         : ----------------------------
<HEADER> Factory                  : === Destroy and recreate all methods via weight files for testing ===
                         : 
                         : Reading weight file: data/weights/TMVAanalysis_Fisher.weights.xml
<HEADER> Factory                  : Test all methods
<HEADER> Factory                  : Test method: Fisher for Classification performance
                         : 
<HEADER> Fisher                   : [data] : Evaluation of Fisher on testing sample (50000 events)
                         : Elapsed time for evaluation of 50000 events: 0.00865 sec       
<HEADER> Factory                  : Evaluate all methods
<HEADER> Factory                  : Evaluate classifier: Fisher
                         : 
<HEADER> Fisher                   : [data] : Loop over test events and fill histograms with classifier response...
                         : 
<HEADER> TFHandler_Fisher         : Variable        Mean        RMS   [        Min        Max ]
                         : -----------------------------------------------------------
                         :        x:     1.9936     2.1256   [   -0.99962     6.9937 ]
                         :        y:     1.9981     2.1285   [   -0.99477     6.9724 ]
                         : -----------------------------------------------------------
                         : 
                         : Evaluation results ranked by best signal efficiency and purity (area)
                         : -------------------------------------------------------------------------------------------------------------------
                         : DataSet       MVA                       
                         : Name:         Method:          ROC-integ
                         : data          Fisher         : 1.000
                         : -------------------------------------------------------------------------------------------------------------------
                         : 
                         : Testing efficiency compared to training efficiency (overtraining check)
                         : -------------------------------------------------------------------------------------------------------------------
                         : DataSet              MVA              Signal efficiency: from test sample (from training sample) 
                         : Name:                Method:          @B=0.01             @B=0.10            @B=0.30   
                         : -------------------------------------------------------------------------------------------------------------------
                         : data                 Fisher         : 1.000 (1.000)       1.000 (1.000)      1.000 (1.000)
                         : -------------------------------------------------------------------------------------------------------------------
                         : 
<HEADER> Dataset:data             : Created tree 'TestTree' with 50000 events
                         : 
<HEADER> Dataset:data             : Created tree 'TrainTree' with 50000 events
                         : 
<HEADER> Factory                  : Thank you for using TMVA!
                         : For citation information, please visit: http://tmva.sf.net/citeTMVA.html

Again I think there are no mistakes.
From this output I get the Fisher coefficients, from which I get the plane which defines the axis which maximaze the separation: z = 5.258 - 1.316x - 1.326y.
To obtain the axis in two dimensions I have found the intersection with the plane z=0, and so I get y = 3.97 - (0.99*x). Is it right?
I report the graph here:
es_7.pdf (636.8 KB)
From this graph, I think it is quite clear that the axis is not the right one.
Where my resoning fails? Thanks in advance!

1 Like

Hi,

to my understanding your interpretation is correct. Could you try running it with a different seed and see whether the results differ?

Another thing to try would be a 1d example with a more pronounced overlap between the two distributions.

What I could image is that the algorithm gets confused in regions with low density. Since there is basically no overlap between the two distributions many lines would be approximately equally good.

Cheers,
Kim

Hi kialbert,

thanks for your answer. I have tried to change the seed, but nothing change, and I have also tried to move closer the distributions but nothing change again. Of course, the line moves but it is always too much closer to the red area and the slope remains the same.

A friend of mine wrote himself the code to perform the fisher analysis, and the result is perfect, so I do not think the data are the problem. Maybe there is something that does not work in my implementation with TMVA.

Hi,

Thanks for the investigation. Iā€™m a bit strapped for time right now, otherwise I would look into it more myself, so would it be possible for you to run a 1D example and report back the results (e.g. two 1D gaussians separated by some margin)?

Could you also report what the the coefficient values you expect are? I.e. what coefficients did your friend arrive at?

Cheers,
Kim