Usage of Category method

Dear TMVA experts,

I’m trying to run the BDT classification with 12 variables. In a subsample of events, 3 out of the 12 variables are missing and are assigned a default value = -9999 no matters if it is a signal or a background event. Even if the BDT looks working pretty nicely, it seems to me a perfect case where the usage of the category method could improve the classification (or at least I would expect a non-worse behavior).
I tried to implement the classification with the category method but I’m a bit perplex about the results. I attach here the macro I used for the training, the output text file and the distributions of the BDT output classifier and a comparison of the ROC-curves. In particular the two-peak shape of the classifier distribution is quite worrisome to me.
I’m wondering if I’m doing something wrong and in case how to implement correctly the procedure.
Please let me know if I have to provide other information. The trees I’m using are quite huge but if needed I can select a subsample to test the macro.
Thank you very much.
Best regards,

andrea

TMVAClassificationCategory.C (13.1 KB)

TMVACategory_output_0_1_12.txt (49.0 KB)

Hi @Andrea_Alici,

I am inviting @moneta to this topic. I am sure he can help you with this.

Cheers,
Javier.

Hi Andrea,

Your macro seems correct to me. At first sight, it seems correct to me. It is an interesting case, I would need to get access to your trees to understand what is happening. If you could share them, for example on cernbox, I could have a look,

Cheers

Lorenzo

Hi Lorenzo,

thanks a lot. I put here (CERNBox) the trees for signal and background, it would be great if you could have a look at it.
Cheers,

andrea

Hi Lorenzo,

I forgot to tell you that the distributions I posted at the beginning of the thread are obtained with TMVAClassificationCategory(1,2,“BDT”,1).
Thanks a lot again.
Cheers,

andrea

Hi,
I have tried your data, but I think your macro requires more data. Can you please send me a reduced version using only the data needed. Thanks

Lorenzo

Hi Lorenzo,

sure, sorry for that. Attached here a version of the macro that works with the trees I put on the CERNbox.
Thank you very much!
Cheers,

andrea

TMVAClassificationCategory.C (10.7 KB)

Hi Andrea,

Thank you for the update of the macro. I could run it and I could reproduce results similar to yours, although the difference between NDT and BDTCat is less pronounced.
Thinking about it, I think it makes sense to me that the performance of the single BDT are superior, because you are using more events for training trees. In the category case you are training separately two different methods, and in principle, if there are no real physical difference between the two cases, it is expected you will get worst performances.
I think the category can be useful for some cases where there are some physical differences between the two category, for example angular region of a detector where you have in place a different type of detector with different resolution, in this case it is maybe better to not mix all the events together. Although I can expect that in several cases, the algorithm can learn the different itself, if they are well reproduced in the data.

Best regards

Lorenzo

Hi Lorenzo,

thank you very much for your reply!
Indeed, the two categories refer to events which are completely similar from the physical point of view (and also from the point of view of detector’s resolution), the only difference is that due to dead areas and detector efficiency in a subsample of events some variables are not present, but apparently the BDT is able to handle this very satisfactorily. Your explanation is very reasonable, now it is more clear to me which are the cases where the usage of the category method could be useful.

Best regards,

andrea