MVA variables correlations

calbet · March 29, 2018, 3:21pm

Dear experts,
I always heard that we should care about the variables correlation in the TMVA. But if I used a lots of variables, such that there are correlations, but in the end I have a good separation and nice ROC curve, is there a good reason “other then time saving…”, that I should not use correlated variables? I mean will the result still make sense despite this correlations?
Regards

amadio · April 10, 2018, 6:06am

This question is vague. What kind of variables are you talking about? Could you please give a more specific example? I don’t think that it’s possible to give you a clear answer otherwise. Are you using a neural network with more neurons/layers than necessary to separate signal/background? It might be a problem due to overfitting, for example.

calbet · April 10, 2018, 3:37pm

Dear Amadio,
if I use pt1, pt2 et pt(1+2), we expect pt(1+2) to be correlated with pt1 and pt2. Now I wonder if it is a problem to use the 3rd variables in the DNN despite its correlation with the 2 other one?
Regards

kialbert · April 10, 2018, 3:50pm

Hi,

As discussed in this answer, in priciple not. However, if you introduce many variables with high correlation you can run into problems with training set size (many input variables implies a large input space and your problem can become under sampled).

Cheers,
Kim