BDT - Understanding the implementation

behrenhoff · October 26, 2015, 1:03pm

Hello everybody,

I am using BDTs - sometimes for large data sets. Thus I am interested in
a) understanding how BDTs work internally
b) reducing the time needed for training and evaluation

In addition I want to have a better understanding of what is going on, so I tried to understand TrainNodeFast (the function that shows up in the profiler).

what is mapVariable good for? Code comment claims “map the subset of variables used in randomised trees to the original variable number (used in the Event() )”, however this variable is never read, only written to.
I would like to understand 1256 (+following) of DecisionTree.cxx:

if (mxVar >= 0) { if (DoRegression()) { node->SetSeparationIndex(fRegType->GetSeparationIndex(nTotS+nTotB,target[0][nBins[mxVar]-1],target2[0][nBins[mxVar]-1])); node->SetResponse(target[0][nBins[mxVar]-1]/(nTotS+nTotB)); if ( (target2[0][nBins[mxVar]-1]/(nTotS+nTotB) - target[0][nBins[mxVar]-1]/(nTotS+nTotB)*target[0][nBins[mxVar]-1]/(nTotS+nTotB)) < std::numeric_limits<double>::epsilon() ) { node->SetRMS(0); }else{ node->SetRMS(TMath::Sqrt(target2[0][nBins[mxVar]-1]/(nTotS+nTotB) - target[0][nBins[mxVar]-1]/(nTotS+nTotB)*target[0][nBins[mxVar]-1]/(nTotS+nTotB))); } }
Why is this target[0], i.e. the target value for the first variable, but nBins-1 of variable mxVar? What if nBins[mxVar] > nBins[0]? Could someone explain what is done here?

Next point is the decision tree. It has Node* as element whereas it contains only DecisionTreeNode* elements. Therefore the tree requires a lot of dynamic_casts to cast them down from Node* to DecisionTreeNode* (which is sloooow). Two possible solutions come to my mind:
a) replace dynamic_cast with static_cast. This works and really improves speed, so I have included it in my patches.
b) make the Nodes* of the tree a template type. Would probably be more clear but more code to rewrite. Do you think it is fine to just use a static_cast?

Some questing about the code style:

In MethodBDT.cxx there lots of const_casts, every time castig away the constness of a “const TMVA::Event*”. Why is this const in the first place?
GetSeparationGain, GetSeparationIndex take “const Double_t&” parameters. Why not using plain Double_t?
General floating point types: there seems to be a mix of Float_t and Double_t functions/variables. Is there some intention behind this? As an example, look at TMVA::DecisionTree::CheckEvent. It returns a Double_t. Possible return values come from functions returning Float_t or Int_t. So why doesn’t it return a Float_t (without loss of precision)?

Two questions: are there some automated tests that I can run: where do I find them, how do I run them? I don’t want to introduce bugs.

More questions: is there a plan to make use of threads? For example, can one evaluate training and testing tree at the same time? What would be reqired, any substantial changes in TMVA or can I just go ahead and try to run them via std::async?

The code changes I have made are here:
github.com/root-mirror/root/pull/100
I have also completely modernized the “monster function” TrainNodeFast, but not pushed yet because it is a more invasive change and I’d like to understand what exactly is going on with target[0] and because I’d like to run more tests.

hvoss · October 26, 2015, 8:31pm

Hi,

thanks a lot for your effort to dig through my old code !!

… as you already wrote yourself, the ‘mapVariable’ is an array that is never used. It might have been used previously and must have forgotten to remove it.

Why is this target[0], i.e. the target value for the first variable, but nBins-1 of variable mxVar? What if >nBins[mxVar] > nBins[0]? Could someone explain what is done here?

Indeed, that is nonsense, a bug I introduced when I added better support for integer valued variables (for which the nbins[ivar] is possibly different ONLY). Guess no one ever used ‘regression’ or ‘gradboost’ yet using explicitly specified integer valued variables that also happened to be the ‘first’ variable - or got garbage. (or just a bad performance which didn’t get noticed as strange).

What this line does is to calculate the ‘Response’ of the current node, which is used in ‘GradBoost’ (or gererally Regression) only. “Response” means what the ‘decision tree’ would answer to a query, i.e. the ‘mean value’ of the node

i.e. target[0][nBins[mxVar]-1] (which should correctly be: target[0][nBins[0]-1] ) is the total number of
events in the node (last bin of the cumlulative distribution) and hence it is also ‘irrelevant’ which variable
is used to calculate this and I arbitrarily used variable 0.

Next point is the decision tree. It has Node* as element whereas it contains only DecisionTreeNode* >elements. Therefore the tree requires a lot of dynamic_casts to cast them down from Node* to >DecisionTreeNode* (which is sloooow). Two possible solutions come to my mind:
a) replace dynamic_cast with static_cast. This works and really improves speed, so I have included it in my >patches.
b) make the Nodes* of the tree a template type. Would probably be more clear but more code to rewrite. >Do you think it is fine to just use a static_cast?

I simply didn’t know that ‘dynamic_cast’ are so time consuming … of course, a ‘static_cast’ should be
totally fine.

Some questing about the code style:

In MethodBDT.cxx there lots of const_casts, every time castig away the constness of a “const >TMVA::Event*”. Why is this const in the first place?

Well because in the rest of the framework ‘someone’ else thought it’s better to have them ‘const’

GetSeparationGain, GetSeparationIndex take “const Double_t&” parameters. Why not using plain >Double_t?

for arguments that should not get changed in a function call, I learned that it’s good practice to specify them as ‘const’ … , and as GetSeparation etc are just ‘calcluators’ … they are set ‘const’.
(I don’t claim to be a good or ‘consisten’ in the usage of const … but sometimes I try… might be ‘wrong’,
but here I don’t see why )

General floating point types: there seems to be a mix of Float_t and Double_t functions/variables. Is

obviously, storing measured variables (i.e. the event variables) as ‘double’ doesn’t make sense, but all
calculations we do in “double”. I actually at some point in the beginning once moved everything to ‘float’ (if I remember correctly, not only in the BDTs, but everywhere) and ran into trouble with precision. There might a place here and there where ‘float’ would be sufficient, but not everywhere and that was then too risky for me.

there some intention behind this? As an example, look at TMVA::DecisionTree::CheckEvent. It returns a >Double_t. Possible return values come from functions returning Float_t or Int_t. So why doesn’t it return a >Float_t (without loss of precision)?

well, the return type is not determined by the BDT but by the rest of the framework. Sure, one could study of everything that get’s calculated afterwards would be sufficient in float, but again… I don’t want to risk someone later coming with a few more events, asking for even more ‘background rejection’ and and getting trouble with the ROC calculations somewhere. Honestly, I lack the overview to judge if that’s fine or not for every possible use case. It might be, but I simply don’t know for sure and hence don’t start experimenting.

'we don’t have many ‘automated tests’ there is the 'stressTMVA.cxx, ($ROOTSYS/test/stressTMVA.cxx) the stardard test that runs with the root test suite which makes some basic checkes. … there is a ‘UnitTest’ suite (something like the root ‘stressTMVA’ stress test) , which I haven’t used for quite a while yet. Maybe its easier first to have a look at stressTMVA.cxx or contact me via emal directly and I could perhaps send you the ‘full’ unit tests and hope they still work )

So I see from the comments on your ‘githup’ fork that you achieved overall some sizable speed up which is phantastic. I almost feel like saying … go ahead with pushing these changes once you are ready!

I didn’t see in gitup where you changed precision which you wrote caused some numerical difference. Can you tell me where ?

Concerning “Threads” … no, there’s no such plan, as it’s certainly more work than I can afford as a hobby in my evenings, for anything else, I’d need a job that pays me to do TMVA

Cheers,

Helge

behrenhoff · October 26, 2015, 10:33pm

Ah, ok! I am going to fix this also in my rewrite.

[quote]I simply didn’t know that ‘dynamic_cast’ are so time consuming … of course, a ‘static_cast’ should be
totally fine.[/quote]

In 99% it is fine but here they are done millions of times, every movement in the tree caused a dynamic_cast. I would have never suspected preformace issues with dynamic_cast (but measure before you assume anything about performance!). Before changing any code, perf pointed me to “exp” and “dynamic_cast”.

[quote]>* GetSeparationGain, GetSeparationIndex take “const Double_t&” parameters. Why not using plain >Double_t?

for arguments that should not get changed in a function call, I learned that it’s good practice to specify them as ‘const’ … , and as GetSeparation etc are just ‘calcluators’ … they are set ‘const’.
[/quote]

The point was more the & to it, not the const. If you have a “large” object, pass it as const& is a good advice. But we have doubles, not large objects, so it is perfectly fine to pass a double as value, no need to introduce an indirection with the reference (on 64 bit, sizeof(Double) = sizeof(any_reference) = 8, so you are not even saving bytes on the stack).
Also it is much easier for the optimizer to work with value types instead of references (at least that’s what Chandler Carruth is claiming for clang).

[quote]So I see from the comments on your ‘githup’ fork that you achieved overall some sizable speed up which is phantastic. I almost feel like saying … go ahead with pushing these changes once you are ready!
[/quote]

The static_cast und unordered_map really save time in all cases.
The exp(a-b) => exp(a)/exp(b) replacement saves time for multiclass trinings where you have many classes. Consider 15 classes, that would be 15*15=225 (minus the a=b cases) = 205 exp calculations which have been replaced with 15x exp and 205x division. For small number of multiclass the speed gain is smaller.

Numerically exp(a-b) != exp(a)/exp(b). That’s why I was warning. In my use case both variants gave exactly the same results. I wasn’t sure which values a and b can take, so I wasn’t sure whether it is allowed to use division here. Consider an extreme case: a=800, b=700. exp(800)/exp(700)=+inf whereas exp(800-700)=exp(100)=2.7e43.

Today I ran the first working version of the rewrite with success (i.e. identical results), but I might have found another bug (a memory leak). If fUseFisherCuts is true (probably it has always been false for my tests), we run into this code:

std::vector<TMatrixDSym*>* covMatrices; covMatrices = gTools().CalcCovarianceMatrices( eventSample, 2 ); // currently for 2 classes only if (!covMatrices){ Log() << kWARNING << " in TrainNodeFast, the covariance Matrices needed for the Fisher-Cuts returned error --> revert to just normal cuts for this node" << Endl; fisherOK = kFALSE; }else{ TMatrixD *ss = new TMatrixD(*(covMatrices->at(0))); TMatrixD *bb = new TMatrixD(*(covMatrices->at(1))); const TMatrixD *s = gTools().GetCorrelationMatrix(ss); const TMatrixD *b = gTools().GetCorrelationMatrix(bb);
None of the pointers get deleted (or the deletes hide from my eyes…).

Also why are there so many pointers to vectors?! Makes it hard for me to understand what is going on.
Do you know if the ROOT TMatrix types support move semantics? We should get rid of the pointers.

Status of the rewrite: give me some more days to finish it (probably I’ll need the weekend). It already compiles and produces identical results but I think the rewritten TrainNodeFast function is still too long.
Also I will try the stressTMVA test.

Cheers
Wolf

hvoss · October 27, 2015, 4:35pm

Hi,

[quote]> I would have never suspected preformace issues with dynamic_cast (but measure before you assume anything about performance!). Before changing any code, perf pointed me to “exp” and “dynamic_cast”.
[/quote]

good point, I should really learn how to use a profiling tool …

ahh… o.k. … when this code was written, we still all had 32 bit machines … and I actually always thought that the less copying the better for performance reason. That’s all. That a compiler might be better perhaps in optimizing code when using pass by value rather than reference is new to me. And particulary at the time of writing this code, I had little idea about these issues (I still do in fact). So yes, if it’s faster if the references are omitted, there’s no reason not to do so.

Again, unordered_map didn’t exist at the time of writing the code (actually C++11 is only VERY recently allowed in ROOT) and are for sure better here. And the possible reduction of exp() calculations is also very well spotted! I would say the numerical differences should at this point really be irrelevant (as they are not about loosing precision really) and the ‘extreme case’ where you get infinity vs. E43 should just be protected against. Should you run into such ‘values’ the boosting is for sure in an ‘ill-posed’ setting somehow.

Well, the “UseFisherCuts” is an attempt to boost the performance of the BDTs by using something smarter than just ‘rectangular cuts’ at each step. I don’t know if it was every successfully used by someone and … well possible that there is an unspotted memory leak …

Mea culpa

… do you thinks so ? I don’t see that many if I quickly scan the code. If I use them then typically because i think it’s better than copying the while vector…

I don’t know… I have to admit, I read about this “move semantics” once, thought it’s yet another complication that I doesn’t really make anything different to what I can do with the more familiar pointers anway. So … guess I’m neither the best, nor the kind of programmer that likes the ever newest features of a language If however it really gives an advantage and is supported, then go ahead !

Great! Please keep me updated. ( Guess I won’t make any changes to the code - i.e. fixing/removing the unused but filled mapvar array, as it doesn’t seem so urgent and it will facilitate your commit eventually)

cheers,

Helge

behrenhoff · October 27, 2015, 6:11pm

I guess that’s because I prefer the modern parts of C++. Rediscovered C++ when C++11 was published

I dislike owning “naked” pointers because you need to handle the deletion manually. In ROOT it is even more complicated because you really never know if you need to handle ownership (for example it depends on TH1::AddDictionary)!

Recently Sutter and Stroustrup have published some “core guidelines” where they suggest in rule R.11: “Avoid calling new and delete explicitly”. (and use owner in cases where you have owning pointers and need to be compatible with old code)

So I was thinking of code like TMVA::CalcCovarianceMatrices where we find:

   std::vector<TVectorD*>* vec = new std::vector<TVectorD*>(matNum);
   std::vector<TMatrixD*>* mat2 = new std::vector<TMatrixD*>(matNum);
   (...)
   std::vector<TMatrixDSym*>* mat = new std::vector<TMatrixDSym*>(matNum);
   (...)
   delete mat2;
   delete vec;
   return mat;

While this code works, I don’t see any reason for the pointers at all. vec and mat2 could be simple vectors. For mat (and thus the return type) I would also remove the pointer (C++11 will make sure the vector is moved (or the compiler might even do return value optimization)). Getting rid of pointers is simplifying code. But one question remains: who deletes the TMatrixDSym* in the mat vector? The caller? So if you can avoid naked owning pointers, you don’t have to think about this.

Anyway, I think my answer is going in the wrong direction because I don’t want to rant about old C++ code. Neither can I modernize all ROOT/TMVA code. Instead let’s focus only on relevant functions (i.e. the ones shown in the profiler). These are the functions to look at (perf output).

[code]+ 27,23% training libTMVA.so [.] TMVA::DecisionTree::TrainNodeFast(std::vector<TMVA::Event const*, std::allocator<TMVA::Event const*> > const&, TMVA::DecisionTreeNode*) ◆

17,41% training libTMVA.so [.] TMVA::Event::GetValue(unsigned int) const ▒
14,26% training libTMVA.so [.] std::__detail::_Map_base<TMVA::Event const*, std::pair<TMVA::Event const* const, std::pair<double, double> >, std::allocator<std::pair<TMVA::Event const* const, std::pair<double, double> > ▒
12,83% training libTMVA.so [.] TMVA::Event::GetWeight() const ▒
11,63% training libTMVA.so [.] TMVA::DecisionTree::BuildTree(std::vector<TMVA::Event const*, std::allocator<TMVA::Event const*> > const&, TMVA::DecisionTreeNode*) ▒
3,70% training libTMVA.so [.] void std::__introsort_loop<__gnu_cxx::__normal_iterator<std::pair<double, double>*, std::vector<std::pair<double, double>, std::allocator<std::pair<double, double> > > >, long>(__gnu_cxx::_▒
2,04% training libTMVA.so [.] TMVA::DecisionTreeNode::GoesRight(TMVA::Event const&) const ▒[/code]