Code for variable ranking in BDT regressions (TMVA)

aharel · August 29, 2012, 8:55pm

Dear TMVA maintainers,

I’ve done another batch of studies of multi-variate regression using TMVA.
Again the best tool for my needs was the BDT with Grad type boosting. This
time most other tools crashed in one stage or another (MLP, BDT with
ada boost, PDE-RS, PDE-Foam). And again, I was hindered by the lack of
variable ranking for regression forests.

So I re-hacked the hack I’ve done last time with a previous ROOT version
to rank the variables. It’s not very deep mathematically, and has a somewhat
unusual interface, as it runs from a reader. But it’s reasonable, with a clean
straight-forward implementation, and is much better than having no
variable rankings at all. It’ll feel silly to have to hack TMVA a third
time for the same functionality…

I hope you’ll consider merging this hack into the main TMVA branch.
I’ve put the relevant files in:

www-d0.fnal.gov/~aharel/ah_tmva_hack_2012.tgz

cheers,
Amnon Harel,
University of Rochester

hvoss · November 1, 2012, 12:31pm

Hi Amnon,

sorry for the very late reply, but we still have problems using the ROOT talk for TMVA. Typically all TMVA
related questions go on the tmva mailing list on sourceforge.

Concerning your suggestion. I’m sorry, we will not implement a ranking based on using the reader etc…
The TMVA code has got alreay way too many “hacks” that will make it in the end impossible to maintain.
The “usefulness” of the ranking has to be questioned anyway… maybe it’s much better in the end to use
the correlation with the target of the variables (independent of the regression algorithm) anyway…

cheers,

Helge

aharel · November 1, 2012, 1:48pm

Dear Helge,

I assume the code can easily be moved to the training code. I’m am not familiar with TMVA’s training code, though, so I’ll leave that to the experts. As for the usefulness of the ranking, I agree it lacks theoretical grounding. It also seems a bit unstable to small changes in the inputs. But the information it provides is quite different than that captured by a simple correlation factor (or even a fancier statistics like mutual information). And it clearly answers a natural question to ask about a BDT!
In practice, I found it useful in understanding the possible input variables when there were several possible ways of preprocessing them. AFAIK, in principle, due to the limited tree depth it can matter whether for two related variables (out of many), one inputs A,B or A,B/A or A,B-A, etc.

cheers,
Amnon