TMVA likelihood classifier Delta separation

Hi Everybody,
I am working with TMVA 4.0.4 to construct signal & Background likelihoods from a set of 30 variables . Than I need to make decision what subset of them make best likelihood classifier with highest separation power because some of them has correlations and some are destructive with combination of others.
I could run and see variables ranking of individual variable from delta separation as well I could find signal&background likelihood template from output file (TMVA.root) but :
1- some variables has negative separation what does it mean ?
2-how can I have access to signal and background likelihood as histogram and how can I find or calculate delta separation of likelihoods of signal&background from test or training sample ?
so sorry if my questions might be stupid :blush: . I appreciate much if you could give some clue because really I need to solve it and already i spend a lot of my time with no success :
Thank you in advance :smiley:
Reza Ahmadi

realy nobody has answer ?

Hi Abifar,

The ranking of variable “i” in the likelihood method is determined by comparing the separation power of the full variable set with the separation power obtained by removing variable “i” form the set. Since likelihood ignores correlations (unless prior decorrelation is applied), it can easily happen that adding another variable actually decreases the separation power of the full set. This gives you a negative separation difference.

All signal and background histograms are available in the TMVA.root output file. The easiest way to look at them is via TMVAGui.C, but could also just browse through them with the root browser.

Now, let me tell you that likelihood is really NOT a good method if you want to combined many variables (and 30 are very many!). If you want to stay with likelihood, I suggest you first find out the 3 or 4 best variables you have from the initial TMVA variable ranking, and see how it performs. Then you should try to gradually add variables, and stop when you cannot increase your performance.

For the likelihood method it is usually beneficial to separate the data samples into subpopulations when distinct variable properties can be detected (for example, a variable that has different background shapes in barrel and endcap). For this, TMVA has the “Category” method (MethodCategory), which allows you to determine (automatically) independent likelihood classifiers in the various distinct categories (the categories must be defined by the user).

For more information, please have a look at the TMVA users guide: tmva.sourceforge.net/docu/TMVAUsersGuide.pdf

If you want to use all 30 variables, and are really seeking for the most powerful classifier, you should probably use the Boosted Decision Tree (at least try it in parallel to the likelihood classifier to see how much you need to improve the likelihood to achieve a competitive result).

Cheers,
Andreas

Hi Andreas,
Thank you for detailed answer, I have to use only likelihood method due to my Analysis restrictions. I found the total likelihood histogram at Method_Likelihood as MVA_Likelihood_B and MVA_Likelihood_S (which I suppose are from test sample Tree and it is from all variables). Now I need to calculate delta separation of these two likelihoods . can you please let me know what method or how should I do this ? I read TMVA user guide very carefully, of course the separation formula is there but I do not see any method or code to use it to get total signal&Background Likelihoods separation? as well I have been trough the source code to find the class or method but no success due to my poor knowledge at C++ .
I appreciate if you give me clue .
Thanks , Reza Ahmadi

Hi Abifar,

I am not sure I fully understand. The separation <S^2> of the likelihood is printed on standard output, so you should have it.

I have attached below a simple C++ function (taken from TMVA/Tools.h) which computes the separation between two histograms.

Hope this helps.

Cheers,
Andreas


Double_t TMVA::tools::GetSeparation( TH1* S, TH1* B ) const
{
// compute “separation” defined as
// = (1/2) Int_-oo…+oo { (S^2(x) - B^2(x))/(S(x) + B(x)) dx }
Double_t separation = 0;

// sanity checks
// signal and background histograms must have same number of bins and
// same limits
if ((S->GetNbinsX() != B->GetNbinsX()) || (S->GetNbinsX() <= 0)) {
std::cout << " signal and background"
<< " histograms have different number of bins: "
<< S->GetNbinsX() << " : " << B->GetNbinsX() << endl;
exit(1);
}

Int_t nstep = S->GetNbinsX();
Double_t intBin = (S->GetXaxis()->GetXmax() - S->GetXaxis()->GetXmin())/nstep;
Double_t nS = S->GetSumOfWeights()intBin;
Double_t nB = B->GetSumOfWeights()intBin;
if (nS > 0 && nB > 0) {
for (Int_t bin=0; bin<nstep; bin++) {
Double_t s = S->GetBinContent( bin )/Double_t(nS);
Double_t b = B->GetBinContent( bin )/Double_t(nB);
// separation
if (s + b > 0) separation += 0.5
(s - b)
(s - b)/(s + b);
}
separation *= intBin;
}
else {
std::cout << " histograms with zero entries: “
<< nS << " : " << nB << " cannot compute separation”
<< endl;
separation = 0;
}

return separation;
}

Hi Andreas,
Really appreciated :smiley:
Thank you , Reza Ahmadi