Home | News | Documentation | Download

TMVA - Handling of Anomalous Values

Dear Experts,

I am giving a fixed amount of data to TMVA of which some of the values will be sometimes be missing. The reason for this is I am giving it track information and per event there are a variable number of tracks but I always provide the same number of input variables. Effectively I would like TMVA to ignore these missing values or to treat them as anomalies, so to that effect I have been setting those values that are missing to a very large negative number i.e -1E33, which is a non-physical number for my distributions as no variables are of this order of magnitude (i.e this is very far away from the distributions) and indeed many of them have only positive values. I was wondering if what I am doing is correct or whether doing so might effect my training in some way? If so, what other options do I have regarding these missing values?

Thanks in advance.

Ifan

Hi,

Setting the missing values to some default value is indeed a technique sometimes employed if the method does not support missing values natively. There are limitations to this, in general I think the method could start relying on this un-physical behaviour. It is then up to you to ensure this does not happen.

However, in practise, I think the risk of this is reasonably low.

Maybe @moneta can provide some additional input.

Also, unfortunately, TMVA does not support any other handling of missing values than padding the input. Alternative implementations of BDT’s do support this however (e.g. xgboost and fastBDT) should this be an option.

Cheers,
Kim

Hi @kialbert,

Thank you for your reply. Okay, that is interesting to know. How would one notice if the method was tending to rely on these padded unphysical values?

Cheers,

Ifan

Hi,

I’ve been looking around to see from where I picked up this idea, and update myself on the latest best-practise but I’ve come up empty-handed. What I do see is that it’s used elsewhere, but I’ve seen no leads as to how to properly handle the statistical side of the problem. (Basically people are reasoning, from what I’ve seen, it gives reasonable output so it’s ok).

Sorry to not be able to give you any more specifics than that.

Cheers,
Kim

Hi @kialbert,

Thanks for the reply. That’s okay, no worries, I will try and have a look at some literature myself and try to find some explanation.

Cheers,

Ifan