I am a beginner with TMVA and probably this is a naive question. I am reading the manual about BDT and I don’t get very clear what is the meaning of the nCuts parameter
Here is what I understand:
Samples are sets of events; each event contains a set of variables and each variable a defined value.
In theory, BDT training scans over all the variables and their values to find the variable and splitting/cut value that produce the best separation. In practice, “the cut values are optimized by scanning over the variable range with a granularity that is set via the option nCuts. The default value of nCuts=20 proved to be a good compromise between computing time and step size”.
Let’s say, the variable is the number of particles in the final state N_i. for 100 events the range of values for N_i is 1<N_i<100; nCuts=20 split this range in 20 and makes 20 scans!
is that statement true? and if so, which one of the 5 values in the chosen subrange is the splitting value?
if the statements is not true then what is the correct meaning of nCuts?
thanks in advance for the help
Hi Jose and welcome to TMVA
The nCuts parameter controls the granularity of the histogramming done as part of the fast training protocol. In this protocol there is only one pass over the data (per node per variable).
In the case of
nCuts=20 two histograms with 21 bins are created and filled with the events, one for signal and one for background. The range is between the minimum and the maximum of the variable and the bin width is uniform.
The separation is then calculated for each split and the best option is chosen.
To put this in terms of your example: The histograms will have the first bin between 0 and 5, the second between 5 and 10 etc. We then loop through all 20 cuts to select the best separation gain and perform the cut on the corresponding variable.
Hope that makes it more clear!
Hi Kim, all,
I am resurrecting this post because I have a question related to nCut parameter and histogram building before the training phase of BDTs in TMVA.
I understand what nCut does. My questions are:
How the xmin and xmax ranges are selected? Is it using FirstBinAboveZero and LastBinAboveZero methods? If for instance a distribution has 1 event at the far end tail with some empty bins inbetween is it still going to be included or is there some overflow approach after certain fraction of events per bin are not available?
Are these distributions then normalized or provided to BDT as is?
To put questions into context: I am trying to visualize how the BDT ‘sees’ the variable distributions when deciding on cuts. I want to make histograms of variables looking similar to the format BDT is putting them into. As far as I understand, TMVAGUI S/B histograms for input variables are not in the same format as histograms used in BDT training node split decisions, for instance binning there does not change with nCut if I am not wrong.
The actual goal is that I want to look at the distributions with and without negative weight events to have some information on how ignoring them changes the shape, in the way BDT sees it. To study different ways of handling such events and which presents BDT with a closer distribution to the actual one. nCuts then gives the binning information but range and normalization is not clear to me.
Thanks in advance!
If I remember correctly, the min and max for each input variable is recorded during pre-processing (taking transformations into account), these are then used to define the cuts.
There is (to my knowledge) no mechanism to account for extreme outliers. Instead a straight binning is used between sample min and max is used (see here).
The resulting distribution is kept unnormalised (i.e. it is a count).
I am sorry to say, I don’t remember from the top of my head exactly how the TMVAGUI S/B works.
Hope this helps, otherwise let me know.
Thank you for your answer. It helps indeed! Could you also explain a bit what transformations do you mean by “taking transformations into account”?
TMVA supports decorrelation and normalisation pre-processing using “Transformations=N,G” etc in the data-loader.
I just meant that the cuts are done on these transformed variables, not in the original space. (Which is proper, I just wanted to point out that this happens as if one does not know about it, it can be difficult to interpret the raw values!)