ValueError: invalid literal for int() with base 10: 'minCNCTrackDist'
Where "minCNCTrackDist was the name of one of the feature’s used to train the XGBClassifier. I checked this feature and it has the same datatype as for all the other features (np.float64). Please could anyone help resolve?
Solution: Train the XGB model using a numpy input rather than a pd.DataFrame. (Can just take your dataframe and call .to_numpy() on it). It seems SaveXGBoost can’t handle labels on data at the moment.
Hi,
Thank you for posting the workaround for the problem. However, I can investigate, if we can have SaveXGBoost working when the XGB model has been trained using a Panda DataFrame.
Do you have some code example and a model file showing this?
Thanks,
Here is the code I used (I have redacted some parts which are not doing much apart from cleaning up data).
with up.open("./data/train_data.root:Events") as f:
train = f.arrays(training_cols + reading_cols, library="pd")
# Do some other cleaning up of these data...I don't include it here but it is nothing
# particularly important
preprocessor.fit(train[training_cols])
train_processed = pd.DataFrame(preprocessor.transform(train[training_cols]), columns=training_cols)
# Compute weights balancing both classes
num_all = len(train_processed)
num_sig = train_processed.IsSignal.value_counts()[1]
num_bkg = train_processed.IsSignal.value_counts()[0]
w = np.hstack([np.ones(num_sig) * num_all / num_sig, np.ones(num_bkg) * num_all / num_bkg])
model = XGBClassifier(
max_depth=4,
reg_alpha=2, # L1 regularization
reg_lambda=4, # L2 regularization
gamma=1, # Minimum loss reduction to initiate new leaf partition
objective='binary:logistic',
learning_rate=0.2,
n_estimators=40,
base_score=0.5,
eval_metric=['auc', 'logloss']
)
X_train = train_processed[training_cols].to_numpy() # Convert to numpy
X_valid = val_processed[training_cols].to_numpy() # Convert to numpy
model.fit(X_train, train_processed.IsSignal.to_numpy(), sample_weight=w)
# I do some testing e.g. accuracy, precision ROC-AUC score...not included here
model.save_model('models/XGBClassifier_v3.3_v1.0.8.json')
ROOT.TMVA.Experimental.SaveXGBoost(model, "multiplicityBDT", "XGBClassifier_v1.0.6.root", num_inputs=len(training_cols))
I should now note that I cannot even get “import ROOT” to work at the moment. I thought my issue might have been to do with outdated modules etc. It seems I have just broken my Python env now.
EDIT: I thought it would also be useful to include the ROOT macro:
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <sstream>
#include <stdlib.h>
void XGBoostasTMVA() {
const int NFEATURES = 19;
// Read pre-processed training data from file
std::ifstream in("X_train_df.csv");
std::string line;
std::vector<std::vector<double>> M;
int i = 0;
while ( getline( in, line ) ) {
if(i == 0) {
i++;
continue;
}
std::stringstream ss( line );
std::vector<double> row;
std::string data;
while (getline( ss, data, ',' )) {
row.push_back(stod( data ));
}
if (row.size() > 0) M.push_back(row);
}
std::cout << "Input data read, reading in BDT model..." << std::endl;
// Load the XGBClassifier from file
TMVA::Experimental::RBDT<> bdt("myBDT", "model_v3.3-1.0.5.root");
std::cout << "Loaded model successfully!" << std::endl;
// Move event data into a flattened array of floats (necessary for RTensor reshape)
float eventData[NFEATURES * M.size()];
for(int i = 0; i < M.size(); i++) {
std::vector<double> thisRow = M[i];
for(int j = 0; j < NFEATURES; j++) {
eventData[(i*NFEATURES)+j] = thisRow[j];
}
}
// Generate predictions
TMVA::Experimental::RTensor<float> x(eventData, {M.size(), NFEATURES});
auto predictions = bdt.Compute(x);
std::cout << "Generated " << predictions.GetSize() << " predictions\n";
std::cout << "For reference M had " << M.size() << " events\n";
// Write predictions to a file for offline comparison with Python
const std::string outfileName = "rbdt_predictions.txt";
std:ofstream outFile(outfileName);
if(outFile.fail()) {
std::cerr << "Error opening file " << outfileName << std::endl;
return EXIT_FAILURE;
}
for(auto& pred : predictions) {
outFile << pred << std::endl;
}
}