TMVA - Signal/Background target responses inverted?

jonesc · July 12, 2017, 1:50pm

Hi,

After a while away from running TMVA I am back looking at the new DNN MVA in ROOT 6.10.02. I have noticed what appears to me slightly odd behavior, which is in some of my trainings the target response (1 or 0) for signal or background are inverted. By which I mean signal(background) is trained to give 0(1), instead of the expected 1(0).

In the end I think I have tracked this down to the fact I use the following logic to fill my training and testing samples.

for ( 'some loop over data entries' ) {

if ( target )
{
  if ( !useForTesting )
  {
     tmvaLoader->AddSignalTrainingEvent( InputDoubles, 1.0 );
  }
  else
  {
     tmvaLoader->AddSignalTestEvent    ( InputDoubles, 1.0 );
  }
 }
 else
 {
   if ( !useForTesting )
   {
     tmvaLoader->AddBackgroundTrainingEvent( InputDoubles, 1.0 );
   }
   else
   {
     tmvaLoader->AddBackgroundTestEvent    ( InputDoubles, 1.0 );
   }
 }

}

where ‘target’ is a boolean that indicates if the data entry is signal or background, and ‘useForTesting’ another boolean to indicate if the entry should be used for training or testing. InputDoubles is an array with all the input parameters for the given data entry.

tmvaLoader is an instance of TMVA::DataLoader.

The issues is, the order that the above calls are first made is not always the same. It depends on the conditionals, and if the first data entry is declared to be signal or not. What I have found is if the first entry is signal, so ‘AddSignalTrainingEvent’ is called first, then TMVA trains the network so give signal the expected response of 1, and background 0. However, if the first data entry is background, so AddBackgroundTrainingEvent is called first, then the logic is for some reason inverted, and signal is trained to give a response of 0…

Note I have used the above logic many times in the past, with previous ROOT versions (using the MLP classifier). So this issue is new to the new ROOT version (6.10.02).

It is also the case that the use of TMVA::DataLoader is also new. So I am not clear if the issue is related to this, or the use of the DNN classifier.

I have a work around, which is just to make sure AddSignalTrainingEvent is called first (I skip entries until I get to the first training signal entry) and this seems to do the job. However, I am curious as to what people think about the above behavior. I doubt somehow its intentional so looks to me like a bug somewhere in TMVA, either in TMVA::DataLoader or perhaps specific to the DNN MVA ?

cheers Chris

kialbert · July 12, 2017, 3:57pm

Hi Chris!

(I responded to you on the TMVA mailing list as well. Adding the answer here for posterity.)

This is a feature of the Dataloader, it creates the class indices dynamically making the order that classes are added important. This to allow more than two classes and custom class names. One can check what index the signal class has by querying the DataSetInfo method GetSignalClassIndex. In your case this would be tmvaLoader->GetDataSetInfo().GetSignalClassIndex().

Another approach would be to add the classes first and ensure the expected order through tmvaLoader->GetDataSetInfo().AddClass(“ClassName”). If you use this second approach the names must be “Background” and “Signal” as you use AddSignalTrainingEvent and friends which expects these classes to exist.

omazapa · July 12, 2017, 8:23pm

Hi Chris,

can you provide a minimal example root macro to reproduce it?

The problem is only with DNN or do you try with BDT for example and it works?

Cheers,
Omar.

kialbert · July 12, 2017, 9:10pm

Appending a parallel mail conversation to merge the discussion and continue it here.

Hi Chris!

Thanks, I wondered if it was a ‘feature’, but I failed to find any
good documentation explaining the new loader class. Is there anything
I can read explaining how to go about using it ? I’ve found the
doxygen docs, but thats not really what I am looking for.
There is supposed be information about the dataloader in the User’s Guide but it has unfortunately not been updated with this yet. It is a simple transformation however, methods that dealt with loading and preparing input was moved to a separate class; It should work the same as the factory did before.

Digging a little further into this I see now that this behaviour has been in TMVA since Jun 22, 2009 and I realise that it is possibly a bug of the DNN. Could you check whether the output of the MLP is as you expect? I will look into a proper fix. For now I can only provide the workarounds previously discussed :/.

Specifically, on your suggestion below, its not clear to me how I go
about ‘ensuring the expected order’ as you describe. Is it just a case
of adding the Signal first, then the background, with
'tmvaLoader->GetDataSetInfo().AddClass("ClassName”)’ ?
I think you want
tmvaLoader->GetDataSetInfo().AddClass("Background”) // adds class 0
tmvaLoader->GetDataSetInfo().AddClass("Signal”) // adds class 1
to get the expected output signal(background) => 1(0).

Thanks for reporting this to us!

Cheers,
Kim

jonesc · July 13, 2017, 8:51am

I cannot really easily provide a mimimal example, as my application is a stand alone C++ executable built against TMVA, and does quite a bit more than just training. For completeness its

https://gitlab.cern.ch/jonrob/ChargedProtoANNPIDTeacher/blob/master/Rec/ChargedProtoANNPIDTeacher/src/teacher/teacher.cpp

I have not tested BDTs, I tend not to use them and favour MLPs instead. However, I have not seen the issue with MLP, only DNN.

kialbert · July 13, 2017, 10:07am

Once again thanks for reporting this Chris. I think we have all we need now to properly look into it!

Here is a minimal reproducer. Notice in the TMVAGUI “Classifier output distribution” that the dnn has the Signal class close to 0 whereas it is expected to be clustered around 1. This is because of the order of addition of the Signal and Background trees. If you flip them around, output is as expected.

TMVAClassification.C (4.8 KB)

omazapa · July 13, 2017, 1:30pm

Hi Chris,
then appear to be a problem in DNN output.
I will to talk with Simon, may he can help us with this issue.

Cheers
Omar.

okarache · June 4, 2020, 2:12pm

Hi,

I see this problem was reported in 2017, but experience similar problem with signal/background swap in the pyKeras NN score.

I see that index of background and signal samples are ok from this printout:

DataSetInfo              : [dataset_pymva_PrFakes] : Added class "Background"
DataSetInfo              : [dataset_pymva_PrFakes] : Added class "Signal"
                         : Add Tree skimTree of type Signal with 188484 events
                         : Add Tree skimTree of type Background with 188483 events
                         : Dataset[dataset_pymva_PrFakes] : Class index : 0  name : Background
                         : Dataset[dataset_pymva_PrFakes] : Class index : 1  name : Signal

If I use only 3 variables for the training, NN score is normal: ~0 for background events and ~1 for signal.
If I add more variables (~10) then score swaps: ~0 for signal and ~1 for background events.

I do trainings for 2 models: ROOT.TMVA.Types.kPyKeras and ROOT.TMVA.Types.kBDT.
Problem of swapped score only appears for kPyKeras if I have ~10 variables.
I’m adding image of the NN and BDT score in the attachment.

Could you advice how to address this issue, please?

With best wishes,
Olena