General questions about TMVA

Einsiedler · January 10, 2018, 11:16pm

Dear TMVA users,

I just started learning about performing multivariate analysis in the context of cosmic-ray research and needless to say, I am a bit lost.

The idea is simple: use TMVA package to be able to classify events corresponding either to cosmic-ray background, either to some very specific signal.

The cosmic-ray background was simulated using CORSIKA on a wide range of energy. The fact that I am looking for ultra-high energy signal forces me to get as much events as possible at the highest energies for the CR background. However, since the CR background has a spectrum in E^-2.7, it means that an extremely large number of events must be simulated at the lowest energy. Computing resources being limited, I decided to simulate a E^-1 spectrum instead and weight my histograms by E^-1.7. So far, so good.

Each image formed on the telescopes cameras by the air-showers generated by the CR background can be characterized by a set of parameters that corresponds to the set of variables I am using for the multivariate analysis. Same goes for the signal images; the basic idea is to use TMVA to say whether an image is signal or background, based on these parameters.

Now, here is my concern: changing the spectrum index of the CR background does not change the range of values of each one of my variables. However, it does have an impact on the distribution (by distribution, I mean the relative height of my histogram bins for each variable). Does this change in distribution have an impact on the TMVA analysis? Does it depend on the method employed (Fisher, BDT, likelihood,etc…).

My intuition is that it shouldn’t really matter but I am not quite sure about it. The reason for that is that it has come to my understanding that one can use as many signal events in the training and testing phases, no matter the proportion of signal he expects to find hidden in the background using real data. Am I right regarding this last statement?

Thank you for reading me and I apologize in advance if I wasn’t clear enough. Any help is appreciated

Cheers,
K.

moneta · January 12, 2018, 2:18pm

Hi,

Of course a change in the input distribution will have an impact on the MVA method. For example different input distributions will result in different sets of weights for your trained Neural Network.
At the end you might then get different score distribution for your signal and monte carlo test data, but at the end you can also re-calibrate your selection. What is important is that your ROC integral does not become smaller, because in that case you will loose discriminating power.

For your last statement, it is not important the number of signal events you will have actually in the data

Best Regards

Lorenzo

Einsiedler · January 12, 2018, 2:53pm

Dear Lorenzo,

thank you very much for your response.

If the distribution matter, then I have to make sure that the method sees that my CR background distributions need to be properly weighted.

As I said, since I generated a spectrum with a power law in E^-2.7 instead of E^-1, the proportion of the highest energies in my data set is more important than it should be. Is there a way to tell tmva to weight my distributions accordingly? In the background tree of my data root file, I have a leaf with the weight of all simulated CR background events. Do I just need to specify this leaf in the line factory->SetBackgroundWeightExpression( “Weight” ); in TMVAClassification.C?

Thank you again for your time!

Cheers,
K.

moneta · January 12, 2018, 2:57pm

Hi,

I presume that in this case you need to set the correct weight to your data. The TMVA methods can handle weighted events and the way to do it is exactly as you describe, by calling
factory->SetBackgroundWeightExpression( “Weight” );

see page 18 of the TMVA Users Guide (https://github.com/root-project/root/blob/master/documentation/tmva/UsersGuide/TMVAUsersGuide.pdf)

Cheers

Lorenzo

Einsiedler · January 12, 2018, 3:20pm

Thank you for your help, it’s a lot more clear now

Cheers,
K.