TMVA BDT Training seems to fail

gwatts · November 2, 2017, 10:08pm

I am doing a multi-class training of a BDT. In root v5 this was quite robust, in v6 I’m having a number of problems. In particular, this one: it seems to fail to train sometimes. Unfortunately, there are no errors messages I can spot in the log. But I know it is broken by the suspiciously clean values for signal and background efficiency at the bottom, and also the fact that the output plots in the training file are empty (and the events appear in the high side overflow bin).

And someone help me? Many thanks in advance! I’ve attached two files - one for the ROOT file that comes out of the training, and the other is the terminal scrape when running this code.

Options I used:
BDT: !V:DrawProgressBar=True:!Silent
TMVA DataFactory:MaxDepth=50:MinNodeSize=0.1:nCuts=200:BoostType=Grad:NegWeightTreatment=IgnoreNegWeightsInTraining:DoBoostMonitor=True

The monitor ntuple in the resulting file is empty. Also, there is no printout of the progress, which seems suspicions.

Cheers,

Gordon.

The root Training File

bdt/TMVA printout logfile (33.7 KB)

kialbert · November 2, 2017, 11:35pm

Hi Gordon,

Would it also be possible to send a macro file to run the training (or a training that reproduces the results)?

The nan's in the variable importance hints to me that I need to look at what happens during the actual training to get insight into the problem.

Cheers,
Kim

gwatts · November 2, 2017, 11:39pm

Many thanks for the fast reply! Yes. I can load that up later this evening (West Coast USA time).

I’ve been exploring this more. It looks like the main driver of this problem is the TreeDepth parameter. If I decrease that to 6, then the training completes as expected. At 7 it fails silently as described above. The MinNodeSize doesn’t seem to have any bearing on this.

Interestingly, there is anecdotal evidence that when I train with much larger datasets the results are the same - that is, large numbers of TreeDepth will fail, but small will be just fine.

Perhaps that gives a better indication of where the bug might be.

BTW - is there an option I can turn on that will give me more of a printout than what I’ve attached? Like a useful verbose mode?

kialbert · November 3, 2017, 12:15am

Great, thanks! And great work with the continued investigation, very helpful!

I’m not sure what it is you want to be printed by “more of a printout”, could you elaborate on that?

Cheers,
Kim

gwatts · November 3, 2017, 5:57am

Here is the file. Ah, I can’t upload the file, it is 5 MB, and the forum has a limit of 3MB. I’ve sent it to your CERN email address. Please let me know if that didn’t work.

To run it, setup root, and then do root -l -b -q training.C from the unpacked directory!

By more of a printout I simply meant if there was a VERBOSE mode - so I could also see more of what was going on in the training.

kialbert · November 3, 2017, 1:46pm

Thanks! I received the file

gwatts · November 6, 2017, 2:33am

Thanks! If there is anything I can do to help with the debugging, please let me know! I see this same problem with the small data set you have in that example, and also with much larger datasets (where training can take 10 hours on my machine).

gwatts · November 6, 2017, 10:53am

I’ve done some more testing. One thing I thought was wrong: this still happens with v5.34/32 of ROOT - for the smaller datasets I was able to run them in the 32 bit version and they show exactly the same behavior.

So there is something about the data I’m feeding it in the bib15 files that is causing this silent training failure. I still don’t know what it is, of course. But anything you can see in the training. Whatever happens, the BDT produces infinity for all three multi-class training outputs.

kialbert · November 7, 2017, 1:32pm

So far I’ve verified what you have told me using a recent development version of ROOT. The I can run the training with 100 trees and depth 50 and get convergence whilst at 800 trees infinities/NaN’s are outputted.

So, as you say, it seems to be a data/training issue where there is a very large gradient or similar. I’ll look into verbose training output next but can’t say for sure when I’ll have the time.

For the time being I would suggest reducing the number of trees or the depth in your classifier.

gwatts · November 7, 2017, 7:11pm

Thank you so much for looking at this.

This makes sense - it is trying to re-weight events and perhaps the misclassified events are having their weights turned up higher and higher as it works its way through the 800 trees. I wonder if that means some of the class data looks identical? I wonder what it is that is unique about this data that is causing this problem (the runaway)?

I will try to reduce the # of trtees and see how that affects the performance.

How do you turn on verbose training output?

gwatts · November 9, 2017, 10:00pm

Armed with your information I did a few tests. Looks like if I limit the number of tree’s I train on to 700, everything works fine. Somewhere between 700 and 800 things fall over. Some quick tests show that there is very little difference between 600 and 700, so I suspect there is even less difference between 700 and 800. In short, I have a viable workaround. Thank you so much! And if you do discovered more about what is causing this I’d like to know.

In particular, I wonder if there are several events that just refuse to be classified properly, and if so, what events they are: that way I might be able to spot something (like a weird weight, or some weird set of event parameters). This request, btw, is based on an assumption on how BDT’s work!

Again, many thanks for your time!

kialbert · November 10, 2017, 3:40pm

Glad I could be of help

I will post anything I find to this topic, but don’t count on anything in the near future.

Unfortunately there is no way (AFAIK) to turn on verbose training of BDT, but I have access to the source code and can add some temporary debug printouts to get an idea of what is going on.

Cheers,
Kim