Convert XGBoost BDT regression model to TMVA

Stepan_Zakharov · May 4, 2020, 1:44pm

Hi, all!
I’m looking for a way of conversion of the regression model trained with XGBoost into TMVA format.
Before that, I tried to use this code [1], but there was a strange bug. The distribution of the regression output should be centered around 1 with the RMS value of 0.1. However, I get 0.5 bias of the mean value when applying the converted regression model on my dataset.
ROOT version I’m using is 6.14.
[1] XGB2TMVA conversion code

Thanks for any help!
Stepan.

moneta · May 5, 2020, 9:02am

Hi Stepan,

I don;t know about this conversion code, maybe @swunsch knows more. As indicated is probably a working version not fully tested yet.
However, we have in ROOT a way to save trained XGBoost BDT in ROOT format and the read back to evaluate them , without going through the XML file, with a new fast inference engine.
See the tutorials
https://root.cern.ch/doc/master/tmva101__Training_8py.html

and
https://root.cern.ch/doc/master/tmva102__Testing_8py.html

Lorenzo

Stepan_Zakharov · May 5, 2020, 7:51pm

Hi Lorenzo!
Thank you for the quick response!
I’ve installed ROOT 6.20 and checked this solution. Unfortunately, this didn’t have an effect. Here is the output of my program:

Blockquote
Traceback (most recent call last):
File “MjjRegTrain.py”, line 106, in
main()
File “MjjRegTrain.py”, line 103, in main
train(params, train_dataset, train_dataset, ‘2017’)
File “MjjRegTrain.py”, line 62, in train
ROOT.TMVA.Experimental.SaveXGBoost(reg_model, “myBDT”, “tmva101.root”)
TypeError: none of the 3 overloaded methods succeeded. Full details:
TMVA::Experimental::SaveXGBoost::TMVA::Experimental::SaveXGBoost() =>
takes at most 0 arguments (3 given)
TMVA::Experimental::SaveXGBoost::TMVA::Experimental::SaveXGBoost(const TMVA::Experimental::SaveXGBoost&) =>
takes at most 1 arguments (3 given)
TMVA::Experimental::SaveXGBoost::TMVA::Experimental::SaveXGBoost(TMVA::Experimental::SaveXGBoost&&) =>
takes at most 1 arguments (3 given)

This looks like SaveXGBoost() has not yet been implemented at the master branch.
Best,
Stepan.

swunsch · May 7, 2020, 7:13am

Hi!

The feature is in 6.20, but you have to use PyROOT experimental (now becoming default in 6.22, release coming soonish). Then, you can run the tutorials that Lorenzo linked above! The requirements are xgboost>=0.81 so that you can dump the model as a json file.
Sry for the inconvenience, we still have the feature in experimental so it’s unfortunately not yet as accessible as it should be.

Best
Stefan

Stepan_Zakharov · May 11, 2020, 4:33pm

Dear Stefan,
I reinstalled ROOT with PyROOT experimental. Then I tried to run the code Lorenzo shared with me. Looks like there is no module tmva100_DataPreparation

Traceback (most recent call last):
File “tmva_train.py”, line 5, in
from tmva100_DataPreparation import variables
File “/home/stepan/root-6-20/lib/python2.7/ROOT/_facade.py”, line 80, in _importhook
return _orig_ihook(name, *args, **kwds)
ImportError: No module named tmva100_DataPreparation

While running my code I got the following error:

Traceback (most recent call last):
File “MjjRegTrain.py”, line 106, in
main()
File “MjjRegTrain.py”, line 103, in main
train(params, train_dataset, train_dataset, ‘2017’)
File “MjjRegTrain.py”, line 62, in train
ROOT.TMVA.Experimental.SaveXGBoost(reg_model, “BDTG”, “tmva101.root”)
File “/home/stepan/root-6-20/lib/python2.7/ROOT/pythonization/_tree_inference.py”, line 87, in SaveXGBoost
fill_arrays(tree, 0, len_inputs * i_tree, len_thresholds * i_tree)
File "/home/stepan/root-6-20/lib/python2.7/ROOT/pythonization/tree_inference.py", line 70, in fill_arrays
input = int(node[“split”].replace(“f”, “”))
ValueError: invalid literal for int() with base 10: ‘ttH_MET’

where ttH_MET is the first of my input variables. I’m using xgboost 0.82, so the part with saving of json files works ok.

Best,
Stepan

swunsch · May 11, 2020, 6:09pm

Hi!

Regarding the failure here

ImportError: No module named tmva100_DataPreparation
I suppose you haven’t copied over the tmva100_DataPreparation.py next to the tmva_train.py file? The training script just imports the variable names from the first tutorial.

Now about your second issue: Is it possible that you named your inputs in xgboost? We expect the features to be named f<some number> as they are by default. Probably you have done something like xgboost.DMatrix(x, label=y, feature_names=labels)? Unfortunately, that would not be supported with the experimental version.

Best
Stefan

Stepan_Zakharov · May 12, 2020, 3:58pm

Hi, Stefan!
Thank you! I could successfully save XGB model into TMVA format. Then my question is about the format of the input dataset for the function bdt.Compute(x). Putting DataFrame object there, I get an error like this:

TypeError: Template method resolution failed:
none of the 2 overloaded methods succeeded. Full details:
vector TMVA::Experimental::RBDT<TMVA::Experimental::BranchlessJittedForest >::Compute(const vector& x) =>
TypeError: could not convert argument 1
TMVA::Experimental::RTensor<float,vector > TMVA::Experimental::RBDT<TMVA::Experimental::BranchlessJittedForest >::Compute(const TMVA::Experimental::RTensor<float,vector >& x) =>
TypeError: could not convert argument 1
Failed to instantiate “Compute(DataFrame)”

Am I correct that my input dataset should be converted into some different format like RTensor<float,vector?

Best,
Stepan.

swunsch · May 12, 2020, 5:24pm

Hi,

So the Compute function is like the predict in other frameworks, like sklearn or I think also xgboost. You can either put in your data as a vector<float>, a RTensor<float> or a numpy.array. However, if you want to inject the model prediction into an RDataFrame, you have to add it in a Define node. Actually, we have a tutorial for this here.

Further, if you want to do this from Python, we have currently the issue that the Python bindings break if you want to make the TMVA::Experimental::Compute(model) work. We just found this out in another forum thread, see here. There you find also the solution to this issue.

Best
Stefan

Stepan_Zakharov · May 13, 2020, 5:53pm

Dear Stefan,

Unfortunately, I didn’t manage to read .root file with BDT.
I used the idea from here and adapted it for bdt case. Before that, I checked this example The first two cases worked ok, so in general, I’m able to read files with the model.
Here is the code I used for my model and dataset (all variables are renamed as f_n, where n from 0 to 21).

ROOT.gInterpreter.ProcessLine(‘’’
TMVA::Experimental::RBDT<> bdt(“BDTG”, “/home/stepan/Desktop/CMS/XGB2TMVA_conv/tmva101.root”);
computeModel = TMVA::Experimental::Compute<22, float>(bdt);
‘’')
df = ROOT.RDataFrame(‘bbggSelectionTree’, ‘/home/stepan/Desktop/CMS/XGB2TMVA_conv/tmp.root’)
df = df.Define(‘y’, ROOT.computeModel,namelist)

While running, it throws TypeError:

File “tmva_mjj_reg.py”, line 49, in mjj_regresson_TMVA
df = df.Define(‘y’, ROOT.computeModel,namelist)
TypeError: Template method resolution failed:
ROOT::RDF::RInterfaceROOT::Detail::RDF::RLoopManager,void ROOT::RDF::RInterfaceROOT::Detail::RDF::RLoopManager,void::Define(experimental::basic_string_view<char,char_traits > name, experimental::basic_string_view<char,char_traits > expression) =>
TypeError: takes at most 2 arguments (3 given)
Failed to instantiate “Define(std::string,TMVA::Experimental::Internal::ComputeHelper<integer_sequence<unsigned long,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21>,float,TMVA::Experimental::RBDT<TMVA::Experimental::BranchlessJittedForest >&>&,NoneType)”
Failed to instantiate “Define(std::string,TMVA::Experimental::Internal::ComputeHelper<integer_sequence<unsigned long,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21>,float,TMVA::Experimental::RBDT<TMVA::Experimental::BranchlessJittedForest >&>*,NoneType)”
Failed to instantiate “Define(std::string,TMVA::Experimental::Internal::ComputeHelper<integer_sequence<unsigned long,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21>,float,TMVA::Experimental::RBDT<TMVA::Experimental::BranchlessJittedForest >&>,NoneType)”

I try to dig in the examples from the TMVA::Experimental code, but unable to find something useful.
Any ideas?

Best,
Stepan.

swunsch · May 14, 2020, 7:27am

Hi!

The output of PyROOT tells us that the namelist is resolved as NoneType, see here:
std::string, <the functor>, NoneType

Is it a valid object?

Best
Stefan

clementhelsens · March 28, 2021, 7:07pm

@swunsch , if I may ask, are the python bindings now able to call Compute as expected?
I see something like this here
https://fossies.org/linux/root/tmva/tmva/test/rbdt_xgboost.py
with root 6.22/08

Thanks,
Clement

swunsch · March 29, 2021, 6:58am

Hi!

There was no major change in functionality. So the cases you linked (these are tests of RBDT against XGBoost), work as expected.

Could you specify what exactly you mean?

Best
Stefan

clementhelsens · March 29, 2021, 7:10am

Thanks for the reply @swunsch, I’m trying to find the easiest way to use in python RDF Compute on RDBT like done in C here:
https://root.cern/doc/master/tmva103__Application_8C.html

When trying with root 6.22/06
bdt = ROOT.TMVA.Experimental.RBDT["", "default"]("myBDT", "/eos/experiment/fcc/ee/analyses/case-studies/flavour/Bc2TauNu/xgb_bdt.root")
I get something like

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cvmfs/sw.hsf.org/spackages/linux-centos7-broadwell/gcc-8.3.0/root-6.22.06-hjtu4d3lq6ltvm7nba47duznmj2f37ci/lib/cppyy/_cpython_cppyy.py", line 86, in __getitem__
    return self.__call__(*(args[0]))
  File "/cvmfs/sw.hsf.org/spackages/linux-centos7-broadwell/gcc-8.3.0/root-6.22.06-hjtu4d3lq6ltvm7nba47duznmj2f37ci/lib/cppyy/_cpython_cppyy.py", line 67, in __call__
    pyclass = _backend.MakeCppTemplateClass(*newargs)
TypeError: 'TMVA::Experimental::RBDT<,default>' is not a known C++ class

so I am wondering if one should continue to use the trick through gInterpreter

swunsch · March 29, 2021, 7:28am

Ah alright! Unfortunately, the simplest solution for Python is a two-line helper function. You can find an example here.

Best,
Stefan

clementhelsens · March 29, 2021, 8:44am

thanks @swunsch.
Here is what I’m trying:

ROOT.gInterpreter.ProcessLine('''
TMVA::Experimental::RBDT<> bdt("Bc2TauNu_BDT", "/eos/experiment/fcc/ee/analyses/case-studies/flavour/Bc2TauNu/xgb_bdt.root");
computeModel = TMVA::Experimental::Compute<10, float>(bdt);
''')

and the RDF call:

.Define("MVA", ROOT.computeModel, {"EVT_thrutshemis_e_min", "EVT_thrutshemis_e_max", "EVT_Echarged_min", "EVT_Echarged_max", "EVT_Eneutral_min", "EVT_Eneutral_max", "EVT_Ncharged_min", "EVT_Ncharged_max", "EVT_Nneutral_min", "EVT_Nneutral_max"})

but I see a similar problem as reported earlier:

  File "examples/FCCee/flavour/Bc2TauNu/analysis_DV.py", line 101, in run
    .Define("MVA", ROOT.computeModel, {"EVT_thrutshemis_e_min", "EVT_thrutshemis_e_max", "EVT_Echarged_min", "EVT_Echarged_max", "EVT_Eneutral_min", "EVT_Eneutral_max", "EVT_Ncharged_min", "EVT_Ncharged_max", "EVT_Nneutral_min", "EVT_Nneutral_max"})
TypeError: Template method resolution failed:
  ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Define(basic_string_view<char,char_traits<char> > name, basic_string_view<char,char_traits<char> > expression) =>
    TypeError: takes at most 2 arguments (3 given)
  Failed to instantiate "Define(std::string,TMVA::Experimental::Internal::ComputeHelper<integer_sequence<unsigned long,0,1,2,3,4,5,6,7,8,9>,float,TMVA::Experimental::RBDT<TMVA::Experimental::BranchlessJittedForest<float> >&>&,set)"
  Failed to instantiate "Define(std::string,TMVA::Experimental::Internal::ComputeHelper<integer_sequence<unsigned long,0,1,2,3,4,5,6,7,8,9>,float,TMVA::Experimental::RBDT<TMVA::Experimental::BranchlessJittedForest<float> >&>*,set)"
  Failed to instantiate "Define(std::string,TMVA::Experimental::Internal::ComputeHelper<integer_sequence<unsigned long,0,1,2,3,4,5,6,7,8,9>,float,TMVA::Experimental::RBDT<TMVA::Experimental::BranchlessJittedForest<float> >&>,set)"

any idea ?
thanks
Clement

swunsch · March 30, 2021, 7:26am

Hi!

I think the issue is that you put in a set as the last argument (in Python the {'foo', 'bar', ...}) rather than an array or list (use ['foo', 'bar', ...] or ('foo', 'bar', ...)). Try again like this:

.Define("MVA", ROOT.computeModel, ("EVT_...", ...))

Best
Stefan

clementhelsens · March 30, 2021, 12:09pm

of course… thanks for spotting this python inconsistency @swunsch. Works as expected!
Thanks for the feedback
Clement

clementhelsens · May 12, 2021, 6:54pm

Hello @swunsch ,

I get a massive seg fault when switching to ROOT v6.24.
The way we save the BDT has changed, we need to add num_inputs like below

ROOT.TMVA.Experimental.SaveXGBoost(bdt, "Bc2TauNu_BDT2", f"{out}/xgb_bdt_stage2.root", num_inputs=len(vars_list))

but for the model evaluation, is there some changes in the syntax?
Thanks,
Clement

EDIT2

in plain python I can run this

import ROOT
ROOT.gInterpreter.ProcessLine('''
TMVA::Experimental::RBDT<> bdt("Bc2TauNu_BDT2", "/eos/experiment/fcc/ee/analyses/case-studies/flavour/Bc2TauNu/xgb_bdt_stage2.root");
computeModel = TMVA::Experimental::Compute<20, float>(bdt);
''')
toto=ROOT.computeModel(0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.)
print(toto.at(0))
5.371956035560288e-08

so it seems to come from the dataframe evaluation

               .Define("MVAVec", ROOT.computeModel, ("EVT_CandMass",
                                                     "EVT_CandRho1Mass",
                                                     "EVT_CandRho2Mass",
                                                     "EVT_CandN",
                                                     "EVT_CandVtxFD",
                                                     "EVT_CandVtxChi2",
                                                     "EVT_CandPx",
                                                     "EVT_CandPy",
                                                     "EVT_CandPz",
                                                     "EVT_CandP",
                                                     "EVT_CandD0",
                                                     "EVT_CandZ0",
                                                     "EVT_CandAngleThrust",
                                                     "EVT_DVd0_min",
                                                     "EVT_DVd0_max",
                                                     "EVT_DVd0_ave",
                                                     "EVT_DVz0_min",
                                                     "EVT_DVz0_max",
                                                     "EVT_DVz0_ave",
                                                     "EVT_Nominal_B_E"))

maybe @eguiraud has an idea?

clementhelsens · May 13, 2021, 8:08am

Seems I was still having a variable defined as double rather than float
sorry for the noise

eguiraud · May 17, 2021, 8:16am

Hi @clementhelsens ,
good that you found the solution! I would argue that it’s still not nice to crash without a human-readable error message because of a type mismatch, feel free to open a github issue about the problem (ideally with a self-container reproducer).

Cheers,
Enrico