Permutation feature importance for TMVA model

Dear ROOT experts,

I want to implement the permutation feature importance estimation algorithm for the BDT model created with the TMVA package. I see three ways of doing it, but all have some issues.

The first would be to transform the TMVA BDT model into a sklearn-compatible model and simply using the function I’ve linked to earlier. Unfortunately I haven’t found any information on how one can do such transformation (only the other way around, sklearn to TMVA). Is it possible to do so?

The second would be to use the RBDT ability to make inference on RDataFrame (as seen in this tutorial). Exactly:

  1. Create an input RDataFrame
  2. Convert it into numpy array
  3. Shuffle one row using numpy functions
  4. Convert it back to RDataFrame
  5. Calculate the model response.

However I haven’t found any documentation on how to use the RBDT interface on an old-school TMVA Gradient Boosted Decision Trees (I’m using ROOT 6.16). Is there a way to use it?

The third is the most straightforward one. Just do the it like in (2), but write the shuffled RDataFrame in a file and use the classic way to calculate the model response in a loop over the events in a tree (like in this tutorial). But this way seems to be way roundabout and too dependent on the I/O speed of the local storage.

Could you suggest a way to better implement this algorithm?

Thanks in advance,
Aleksandr

Hi!

Do I understand correctly that the permutation feature importance shuffles a single column (aka feature or branch)? This is a highly uncommon operation for data processing in HEP and therefore it’s not straight forward. For sure, the most simple solution is doing the operations fully in Python (and in memory, if it’s possible for you).

You can perform this fairly easily with some more experimental ROOT features. The following scripts shows how you can apply the BDT model trained in TMVA on data in Numpy arrays (note that the script is fully runnable, but requires at least ROOT v6.22):

import ROOT
import numpy as np

# Train a BDT with TMVA
output = ROOT.TFile('TMVA.root', 'RECREATE')
factory = ROOT.TMVA.Factory('tmva003', output, '!V:!DrawProgressBar:AnalysisType=Classification')

filename = 'http://root.cern.ch/files/tmva_class_example.root'
data = ROOT.TFile.Open('http://root.cern.ch/files/tmva_class_example.root')
signal = data.Get('TreeS')
background = data.Get('TreeB')

dataloader = ROOT.TMVA.DataLoader('tmva003_BDT')
variables = ['var1', 'var2', 'var3', 'var4']
for var in variables:
    dataloader.AddVariable(var)

dataloader.AddSignalTree(signal, 1.0)
dataloader.AddBackgroundTree(background, 1.0)
dataloader.PrepareTrainingAndTestTree('', '')

factory.BookMethod(dataloader, ROOT.TMVA.Types.kBDT, 'BDT', '!V:!H:NTrees=300:MaxDepth=2')
factory.TrainAllMethods()

# Load the model
# NOTE: You are entering experimental terrain!
model = ROOT.TMVA.Experimental.RReader('tmva003_BDT/weights/tmva003_BDT.weights.xml')

# Load the data to Numpy arrays
npy_sig = ROOT.RDataFrame('TreeS', filename).AsNumpy(variables)
npy_bkg = ROOT.RDataFrame('TreeB', filename).AsNumpy(variables)

x_sig = np.vstack([npy_sig[var] for var in variables]).T
x_bkg = np.vstack([npy_bkg[var] for var in variables]).T
x = np.vstack([x_sig, x_bkg])
y_true = np.hstack([np.ones(x_sig.shape[0]), np.zeros(x_bkg.shape[0])])

# Apply the model on the data
x_rtensor = ROOT.TMVA.Experimental.AsRTensor(x)
y_pred_rtensor = model.Compute(x_rtensor)
y_pred = np.asarray(y_pred_rtensor)

# Compute the ROC
# NOTE: Here you should iterate with the shuffled inputs
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_true, y_pred)
print(f'AUC: {auc(fpr, tpr):.2f}')
# AUC: 0.93

Otherwise, you can do the same in a C++ program, which would offer you an excellent runtime performance. But this choice is dependent on your specific problem!

Best
Stefan

1 Like