RDataFrame and TMVA with KFolding

rooter_03 · July 23, 2021, 8:51am

Hi,

I am trying to use:

import ROOT

#---------------------------------------
def add_bdt(df, xmlpath):
    ROOT.gInterpreter.ProcessLine('''TMVA::Experimental::RReader model("{}");'''.format(xmlpath))
    nvars = ROOT.model.GetVariableNames().size()
    ROOT.gInterpreter.ProcessLine('''auto computeModel = TMVA::Experimental::Compute<{}, float>(model);'''.format(nvars))

    l_expr = ROOT.model.GetVariableNames()
    l_varn = ROOT.std.vector['std::string']()
    for i_expr, expr in enumerate(l_expr):
        varname = 'v_{}'.format(i_expr)
        l_varn.push_back(varname)

        df=df.Define(varname, '(float)({})'.format(expr) )
        
    df = df.Define('mva', ROOT.computeModel, l_varn)

    return df
#---------------------------------------
xmlpath='mva_0.xml'
filepath='file.root'

df=ROOT.RDataFrame('tree', filepath)
df=add_bdt(df, xmlpath)

c = ROOT.TCanvas('c', '', 600, 600)
h = df.Histo1D('bdt')
h.Draw()
c.SaveAs('plot.png')

In order to calculate the mva classifier column for the tree tree in the file file.root using the weights from mva_0.xml.

Now, the problem is that each entry in the tree has a branch called fold, which is an integer between 0 and 4. At the same time I have 5 xml files (mva_0, 1,2,3,4.xml) such that each entry of the tree needs to be assigned an MVA score from the evaluation of those xml files.

Question: How do I modify the code above such that the computeModel switches between different model objects depending on the fold value?

Cheers.

Please read tips for efficient and successful posting and posting code

_ROOT Version:6.22/06
_Platform:Centos7
_Compiler:gcc8-opt

Or just use what is in:

/cvmfs/sft.cern.ch/lcg/views/LCG_99/x86_64-centos7-gcc8-opt/bin/root

rooter_03 · July 25, 2021, 8:03am

Hi,

Anyway, I did it in a less elegant way, but it seems to work. However, please look at what I did in the code above. I created a function that takes the XML file, the data frame and just adds a BDT score. All the other nasty stuff (declaring variables, taking care of types, etc) is hidden. As analyzers, we do not need to know any of that. It can be nicely hidden in such a way that we only interact with a small function, our lives are already hard dealing with Physics and we do not want to have to also deal with code.

ROOT code should focus on solving problems, what is the problem here? We have an MVA score in the XML file, we want to put it in the tree/dataframe. We only need one small function for that and ideally we would not have to write it ourselves.

Cheers.

eguiraud · July 27, 2021, 7:43am

Hi @rooter_03 ,
I think we need @moneta 's help with your first post, I don’t know enough about TMVA and reading weights from XMLs.

Thank you for following up with a second post. If this is a common problem we might definitely provide a helper function, @moneta what do you think?

Cheers,
Enrico

moneta · August 3, 2021, 2:28pm

Hi,
If I have understood you well, you would like a better integration of the TMVA model prediction with the RDataFrame to avoid writing functions as the one above.
We are working on this and we are aiming to have something for the next release.
Thank you for your feedback

Lorenzo

rooter_03 · August 4, 2021, 9:05am

Hi,

Thanks for your reply. I think something like this:

df.addMVA('mva_1.xml', 'mva_1')
df.addMVA('mva_2.xml', 'mva_2')

would nicely wrap all into one function. This way we could compare multiple classifiers or use them together.

For k-folding maybe:

d_fold={}
if True:
    d_fold[0] = 'mva_0.xml'
    d_fold[1] = 'mva_1.xml'
    d_fold[2] = 'mva_2.xml'

df.addMVAFolds(d_fold, 'fold', 'mva')

where d_fold contains the correspondence between the fold and the XML file, fold is a column that should exist in the data frame (which allows picking the score fromt the right XML) and mva is the new column with the score.

I am trying to write my code this way and I guess many people are writting the same code that I write, for each analysis. Which is wasteful, because we spend time writting hundreds of lines of the same bad or at most barely acceptable (given that many of the coders are students who are still learning) code instead of having that code, written once and well as part of ROOT.

Code like this would remove hundreds of lines from many analyses code bases and would make us faster and less prone to bugs. This is only my view and it would be good to have people who actually do data analysis giving further feedback.

Cheers.

system · August 18, 2021, 9:05am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.