Accessing training and test trees after call to `PrepareTrainingAndTestTree`

xmif0001 · July 8, 2021, 11:13am

Hi there! I’m a summer student at CERN currently working on BDT benchmarking for TMVA. However, this is my first week and I’m still very new to TMVA and ROOT, so I’m unfamiliar with certain things.

I think to illustrate my problem, its best if I immediately start with a toy example…so here we go:

    // Dataset source along with variables
    const std::string filepath = "http://root.cern.ch/files/tmva_class_example.root";
    const std::vector<std::string> variables = {"var1", "var2", "var3", "var4"};

    auto data = TFile::Open(filepath.c_str());
    auto signal = (TTree*) (data->Get("TreeS"));
    auto background = (TTree*) (data->Get("TreeB");

    // Add variables and register the trees with the dataloader
    auto dataloader = new TMVA::DataLoader("tmva003_BDT");
    for (const auto &var : variables) {
        dataloader->AddVariable(var);
    }

    dataloader->AddSignalTree(signal, 1.0);
    dataloader->AddBackgroundTree(background, 1.0);
    dataloader->PrepareTrainingAndTestTree("", ""); // Key step: divide into training and test trees

The above example is heavily based on that given in ROOT: tutorials/tmva/tmva003_RReader.C File Reference

At this point, I wish to access the TTree instances corresponding to the signal and background training and test sets (which I’m assuming are 4 separate instances after applying the cut on the signal and background TTrees in the call dataloader->PrepareTrainingAndTestTree("", "") - or am I wrong?).

I need the resulting split signal and background datasets because if I’m going to benchmark against some other BDT package, I want to ensure that the datasets are split into the same training and test subsets.

Looking at the source files and following the chain of calls, I’ve arrived at the following partial solution (well, I think it’s in the right direction…). For example to extract the testing dataset after the cut,

dataloader->GetDefaultDataSetInfo().GetDataSet()->GetTree(TMVA::Types::kTesting)

However, I’m not sure how I can then separate into background and signal trees.

Any help would be greatly appreciated! Also feel free to highlight any misconceptions I might have! Thanks

couet · July 8, 2021, 11:19am

I think @moneta can give you some hints.

xmif0001 · July 8, 2021, 11:27am

Indeed, he gave me the hint to look into DataSet which gives me access to the Event vector, but again I do not necessarily understand how I can extract the split datasets. I’ll continue looking into it, thanks.

moneta · July 8, 2021, 1:20pm

Hi,
You can call TMVA::DataSet::SetCurrentType(kTraining) or kTest to select the training or the test data.
This will give you the two splitting data, but they are with the labels (signal and background) mixed.
To get the right label, you need to look Event by Event and call DataSetInfo::IsSignal

Lorenzo

xmif0001 · July 9, 2021, 6:58am

Thanks for the clarification.