Hello ROOT enthusiasts :)
I encountered an issue when converting a classifier from XGBoost python package to .root format:
I trained and optimised the classifier using XGBoost. I need to apply to a lot of data in .root format and therefore decided to use ROOT.TMVA.Experimental.SaveXGBoost to convert the classifier to ROOT TMVA and then load it using ROOT.TMVA.Experimental.RBDT. I can apply the classifier using the Compute method of the ROOT TMVA.
However, the classifier response deviates from using XGBoost directly when I use the converted ROOT TMVA.
I traced the issue down to the early stopping functionality that I use in the training of the XGBoost and some L1 regularisation parameter that seems to amplify the deviations.
I attached a minimal working example on how to recreate this behaviour.
I am using ROOT 6.32.02 and XGBoost 2.1.3
It would be nice to know if there exists another solution to use the (XGBoost) classifier in ROOT or avoid this bug somehow.
Cheers,
Lukas
Here the code as I canât attach it in a file (new account):
import numpy as np
import pandas as pd
import ROOT
from sklearn.model_selection import train_test_split
import xgboost as xgb
from xgboost import XGBClassifier
# random state set for reproducability
random_state = 42
n_signal = 5000
n_background = 5000
n_features = 5
xgb_params = {
ân_estimatorsâ: 150,
âmax_depthâ: 2,
âlearning_rateâ: 0.5,
âlambdaâ: 30,
âearly_stopping_roundsâ: 30, # This parameter causes issues when converting the classifier to .root format
âalphaâ: 1, # This parameter can increase the deviations
âsubsampleâ: 0.7,
}
# ----------------------------------------------------------------------------------------------------------------------
# Generate toy data
# ----------------------------------------------------------------------------------------------------------------------
signal = np.random.normal(loc=1.0, scale=1.0, size=(n_signal, n_features))
background = np.random.normal(loc=-1.0, scale=1.0, size=(n_background, n_features))
signal_labels = np.ones(n_signal, dtype=np.int32)
background_labels = np.zeros(n_background, dtype=np.int32)
columns = [f"feature_{i}" for i in range(n_features)]
df_signal = pd.DataFrame(signal, columns=columns)
df_signal[âlabelâ] = signal_labels
df_background = pd.DataFrame(background, columns=columns)
df_background[âlabelâ] = background_labels
df = pd.concat([df_signal, df_background], ignore_index=True)
# ----------------------------------------------------------------------------------------------------------------------
# Train classifier
# ----------------------------------------------------------------------------------------------------------------------
# Train test split
X_train, X_test, y_train, y_test = train_test_split(df[columns], df[âlabelâ], test_size=0.3, stratify=df[âlabelâ], random_state=random_state)
clf = XGBClassifier(
objective = âbinary:logisticâ,
random_state=random_state,
**xgb_params, # parse parameters
)
# fit to training data
clf.fit(
X_train.to_numpy(), y_train,
eval_set=[(X_train.to_numpy(), y_train), (X_test.to_numpy(), y_test)],
verbose=False,
)
# ----------------------------------------------------------------------------------------------------------------------
# Convert classifier to ROOT
# ----------------------------------------------------------------------------------------------------------------------
ROOT.TMVA.Experimental.SaveXGBoost(
clf,
âclassifierâ,
âclassifier.rootâ,
num_inputs=n_features,
)
clf_root = ROOT.TMVA.Experimental.RBDT(âclassifierâ, âclassifier.rootâ)
# Apply classifier directly using XGB interface and using ROOT.TMVA.Experimental
data = np.array(X_test, dtype=âfloat32â) # ensure datatype is float
y_pred_xgb = clf.predict_proba(data)[:,1].flatten()
y_pred_root = clf_root.Compute(data).flatten()
print(â----------------------------------------â)
print(f"Max. absolute deviation: {max(abs(y_pred_root-y_pred_xgb)):.2f}")
print(f"Max. relative deviation: {max(abs(y_pred_root-y_pred_xgb)/y_pred_xgb)*100:.2f}%")
print(â----------------------------------------â)