Hello everybody,
I am using BDTs - sometimes for large data sets. Thus I am interested in
a) understanding how BDTs work internally
b) reducing the time needed for training and evaluation
In addition I want to have a better understanding of what is going on, so I tried to understand TrainNodeFast (the function that shows up in the profiler).
- what is mapVariable good for? Code comment claims “map the subset of variables used in randomised trees to the original variable number (used in the Event() )”, however this variable is never read, only written to.
- I would like to understand 1256 (+following) of DecisionTree.cxx:
if (mxVar >= 0) {
if (DoRegression()) {
node->SetSeparationIndex(fRegType->GetSeparationIndex(nTotS+nTotB,target[0][nBins[mxVar]-1],target2[0][nBins[mxVar]-1]));
node->SetResponse(target[0][nBins[mxVar]-1]/(nTotS+nTotB));
if ( (target2[0][nBins[mxVar]-1]/(nTotS+nTotB) - target[0][nBins[mxVar]-1]/(nTotS+nTotB)*target[0][nBins[mxVar]-1]/(nTotS+nTotB)) < std::numeric_limits<double>::epsilon() ) {
node->SetRMS(0);
}else{
node->SetRMS(TMath::Sqrt(target2[0][nBins[mxVar]-1]/(nTotS+nTotB) - target[0][nBins[mxVar]-1]/(nTotS+nTotB)*target[0][nBins[mxVar]-1]/(nTotS+nTotB)));
}
}
Why is this target[0], i.e. the target value for the first variable, but nBins-1 of variable mxVar? What if nBins[mxVar] > nBins[0]? Could someone explain what is done here?
Next point is the decision tree. It has Node* as element whereas it contains only DecisionTreeNode* elements. Therefore the tree requires a lot of dynamic_casts to cast them down from Node* to DecisionTreeNode* (which is sloooow). Two possible solutions come to my mind:
a) replace dynamic_cast with static_cast. This works and really improves speed, so I have included it in my patches.
b) make the Nodes* of the tree a template type. Would probably be more clear but more code to rewrite. Do you think it is fine to just use a static_cast?
Some questing about the code style:
- In MethodBDT.cxx there lots of const_casts, every time castig away the constness of a “const TMVA::Event*”. Why is this const in the first place?
- GetSeparationGain, GetSeparationIndex take “const Double_t&” parameters. Why not using plain Double_t?
- General floating point types: there seems to be a mix of Float_t and Double_t functions/variables. Is there some intention behind this? As an example, look at TMVA::DecisionTree::CheckEvent. It returns a Double_t. Possible return values come from functions returning Float_t or Int_t. So why doesn’t it return a Float_t (without loss of precision)?
Two questions: are there some automated tests that I can run: where do I find them, how do I run them? I don’t want to introduce bugs.
More questions: is there a plan to make use of threads? For example, can one evaluate training and testing tree at the same time? What would be reqired, any substantial changes in TMVA or can I just go ahead and try to run them via std::async?
The code changes I have made are here:
github.com/root-mirror/root/pull/100
I have also completely modernized the “monster function” TrainNodeFast, but not pushed yet because it is a more invasive change and I’d like to understand what exactly is going on with target[0] and because I’d like to run more tests.