TMultiLayerPerceptron exits with Stop

wilka · March 13, 2009, 5:44pm

Hi all,

I use TMultiLayerPerceptrons for a while and never had any problems. I train over several epochs with a small eta, save the networks (with DumpWeights()) and continue training. In the past everything worked fine. When I now run the training (using root v5-22-00) I got sometimes this error message:

“Error in TMultiLayerPerceptron::TMultiLayerPerceptron::Train(): Stop.
Epoch: 100 learn=nan test=nan
Training done.”

And then it continues with:

“Info in TMultiLayerPerceptron::Train: Using 17866 train and 8932 test entries.
Training the Neural Network
Warning in TCanvas::Constructor: Deleting canvas with same name: NNtraining
Error in TMultiLayerPerceptron::TMultiLayerPerceptron::Train(): Line search fail
Epoch: 100 learn=0.832555 test=0.832555
Training done.”

As far as I followed the code the first error message is printed if TMultiLayerPerceptron::GetError() is nan. How can this happen? Can somebody help me to understand this? I am using the kStochastic learning method.

Best regards and thank you for your help,
Alex

PS:
Here is the piece of code I am using:

TMultiLayerPerceptron *mNet;
mNet = new TMultiLayerPerceptron(“fdEdxBin[0],fdEdxBin[1],fdEdxBin[2],fdEdxBin[3],fdEdxBin[4],fdEdxBin[5],fdEdxBin[6],fdEdxBin[7],fdEdxBin[8],fdEdxBin[9]:15:7:pid[0], pid[1]!”, tIn, fTrain, fTest);

Bool_t bFirstLoop = 0;

for(Int_t iEpoch = 0; iEpoch < nEpochs; iEpoch++){

if(bFirstLoop == 1){
  mNet -> SetLearningMethod(TMultiLayerPerceptron::kStochastic);
  mNet -> TMultiLayerPerceptron::SetEta(0.001);
  mNet -> Train(100,"text update=10, graph");                     

  bFirstLoop = 0;
}
else{
  mNet -> Train(100,"text update=10, graph +");
}

mNet -> DumpWeights(Form("%s/Net_%d", $PWD, Epoch));

}

delaere · March 14, 2009, 10:33am

Hi,

It happens when some input is NaN.
One variable of one entry in you tree is enough… so you should define your input training and test sample to avoid it.

wilka · March 14, 2009, 11:10am

Hi,

thank you for your answer. This I would understand if it would happen directly at the start of the training, but it happens after several epochs. In one of the cases the first 99 epochs worked fine, but in epoch 100 the error occured. When one variable in one entry would be nan it should happen directly at the beginning and not after several epochs?

Alex

delaere · March 15, 2009, 10:16am

Indeed… in addition, it looks like I now protected the input against such cases.

I cannot tell where the NaN comes from without a (non)working example… it might be in several places (none being totally normal and expected). In any case, it shows some problem with the precise network being trained: weird distribution of the input (like a discrete variable), insensitivity of the network, etc.

Protections already in place are such that the best result is extracted. When it does not converge there is almost always something to change in the formulation of the network.

Try to dump (and inspect) the weights after the training stopped… maybe the network converged to some bad weight.

wilka · March 15, 2009, 7:16pm

Ok, thank you. I found the error. The problem was that the flag for the first training loop was not correctly set. Therefore the training method was not the one I wanted to use and the network seemed to be not correctly initialized. The synapses connecting the hidden layers with each other and the output layer were not randomized. With the kStochastic training method everything works fine.