ROOT.TMVA.Experimental.SaveXGBoost Fails

Chandler_K · March 28, 2023, 10:44am

I have a trained XGBClassifier from XGBoost (v1.7.4) in Python. I then load it in (and that works fine):

model = xgb.XGBClassifier()
model.load_model('/path/to/train/model.json')

I then try to convert to a ROOT friendly format with:

ROOT.TMVA.Experimental.SaveXGBoost(model, "multiplicityBDT", "XGBClassifier_v1.0.6.root", num_inputs=19)

However, I get an error:

ValueError: invalid literal for int() with base 10: 'minCNCTrackDist'

Where "minCNCTrackDist was the name of one of the feature’s used to train the XGBClassifier. I checked this feature and it has the same datatype as for all the other features (np.float64). Please could anyone help resolve?

Edit:
ROOT Version: 6.24/02
Python Version: 3.8.10
OS: Using WSL2 (Windows 10)
XGBoost Version: 1.7.4

Chandler_K · March 28, 2023, 12:49pm

Solution: Train the XGB model using a numpy input rather than a pd.DataFrame. (Can just take your dataframe and call .to_numpy() on it). It seems SaveXGBoost can’t handle labels on data at the moment.

moneta · March 29, 2023, 10:24am

Hi,
Thank you for posting the workaround for the problem. However, I can investigate, if we can have SaveXGBoost working when the XGB model has been trained using a Panda DataFrame.
Do you have some code example and a model file showing this?
Thanks,

Lorenzo

Chandler_K · March 29, 2023, 10:49am

Here is the code I used (I have redacted some parts which are not doing much apart from cleaning up data).

with up.open("./data/train_data.root:Events") as f:
    train = f.arrays(training_cols + reading_cols, library="pd")

# Do some other cleaning up of these data...I don't include it here but it is nothing
# particularly important

preprocessor.fit(train[training_cols])
train_processed = pd.DataFrame(preprocessor.transform(train[training_cols]), columns=training_cols)

# Compute weights balancing both classes
num_all = len(train_processed)
num_sig = train_processed.IsSignal.value_counts()[1]
num_bkg = train_processed.IsSignal.value_counts()[0]
w = np.hstack([np.ones(num_sig) * num_all / num_sig, np.ones(num_bkg) * num_all / num_bkg])

model = XGBClassifier(
    max_depth=4,
    reg_alpha=2, # L1 regularization
    reg_lambda=4, # L2 regularization
    gamma=1, # Minimum loss reduction to initiate new leaf partition
    objective='binary:logistic', 
    learning_rate=0.2,
    n_estimators=40,
    base_score=0.5,
    eval_metric=['auc', 'logloss']
)

X_train = train_processed[training_cols].to_numpy() # Convert to numpy 
X_valid = val_processed[training_cols].to_numpy() # Convert to numpy
model.fit(X_train, train_processed.IsSignal.to_numpy(), sample_weight=w)

# I do some testing e.g. accuracy, precision ROC-AUC score...not included here

model.save_model('models/XGBClassifier_v3.3_v1.0.8.json')
ROOT.TMVA.Experimental.SaveXGBoost(model, "multiplicityBDT", "XGBClassifier_v1.0.6.root", num_inputs=len(training_cols))

I should now note that I cannot even get “import ROOT” to work at the moment. I thought my issue might have been to do with outdated modules etc. It seems I have just broken my Python env now.

EDIT: I thought it would also be useful to include the ROOT macro:

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <sstream>
#include <stdlib.h>

void XGBoostasTMVA() {
    const int NFEATURES = 19;

    // Read pre-processed training data from file
    std::ifstream in("X_train_df.csv");
    std::string line;
    std::vector<std::vector<double>> M;
    int i = 0;
    while ( getline( in, line ) ) {
        if(i == 0) {
            i++;
            continue;
        }
        std::stringstream ss( line );
        std::vector<double> row;
        std::string data;
        while (getline( ss, data, ',' )) {
            row.push_back(stod( data ));
        }
        if (row.size() > 0) M.push_back(row);
   }
    
    std::cout << "Input data read, reading in BDT model..." << std::endl;

    // Load the XGBClassifier from file
    TMVA::Experimental::RBDT<> bdt("myBDT", "model_v3.3-1.0.5.root");
    std::cout << "Loaded model successfully!" << std::endl;

    // Move event data into a flattened array of floats (necessary for RTensor reshape)
    float eventData[NFEATURES * M.size()];
    for(int i = 0; i < M.size(); i++) {
        std::vector<double> thisRow = M[i];
        for(int j = 0; j < NFEATURES; j++) {
            eventData[(i*NFEATURES)+j] = thisRow[j];
        }
    }

    // Generate predictions
    TMVA::Experimental::RTensor<float> x(eventData, {M.size(), NFEATURES});
    auto predictions = bdt.Compute(x);

    std::cout << "Generated " << predictions.GetSize() << " predictions\n";
    std::cout << "For reference M had " << M.size() << " events\n";

    // Write predictions to a file for offline comparison with Python
    const std::string outfileName = "rbdt_predictions.txt";
    std:ofstream outFile(outfileName);
    if(outFile.fail()) {
        std::cerr << "Error opening file " << outfileName << std::endl;
        return EXIT_FAILURE;
    }

    for(auto& pred : predictions) {
        outFile << pred << std::endl;
    }
}