Assign and reference indices or names to TTrees or TChains used with TMVA::DataLoader::AddSignalTree() or TMVA::DataLoader::AddBackgroundTree()

jwruss · February 16, 2022, 1:21am

Hello,

After writing output from a TMVA::Factory to a binary file, I can look at “className” under “TrainTree” and “TestTree” to determine which events in training and testing were used as signal or background. What I would like to know is if there is a way to write under these TTrees either an integer index or character string to indicate which TTree or TChain an event originates from when it is added to a TMVA::DataLoader object through AddSignalTree() or AddBackroundTree()?

I think this can be readily done in the multiclass case using the “className” argument under the function AddTree(), but I don’t think that function can be used with binary classification?

I was thinking I might be able to do it somehow using TMVA::DataLoader::AddSpectator() somehow, but I’m not so sure.

Having assigned a unique weight to each TTree or TChain when calling AddSignalTree() and AddBackgroundTree(), I might be able to refer to values of “weight” under “TrainTree” and “TestTree”. But those weights would have to be recalculated to be referenced this way, and recalculating them requires access to the initial TTrees and TChains. I would like to be able to be able to just refer to TrainTree and TestTree and know where an event originates from without having to refer back to the initial signal and background trees that were added to the DataLoader.

moneta · February 16, 2022, 2:54pm

Hi,
The information you want is not stored in the output, apart maybe of the weight, which could be used if all events from the same Tree have the same weight and different from the others.
I think the best option is to use a spectator variable, which is different for every tree, otherwise you will have to manage this information yourself outside of TMVA

Cheers

Lorenzo

jwruss · February 16, 2022, 6:20pm

Thanks, Lorenzo.

With the spectator variable, is there a way to create such a variable for each signal and background tree/chain when the tree/chain doesn’t explicitly have a leaf with a label variable defined? I know that spectator variables can be defined in terms of other variables coming from the tree/chain, but is there a way to reference a tree/chain property like a hash address for the tree? Or reference the order in which trees/chains are added into a DataLoader through AddSignalTree or AddBackgroundTree?

Best,
John

moneta · February 16, 2022, 10:35pm

Hi John,

I don’t think this is possible, could not add an extra variable directly in the Tree ?

Lorenzo

jwruss · February 16, 2022, 10:41pm

Not easily, unfortunately. Hence my asking in the forum if there was a workaround.

moneta · February 17, 2022, 8:22am

If you use the RDataFrame it should be easy to add the variable in the Tree, here is an example using the hsimple.root file generated running the tutorial hsimple.C:

ROOT::RDataFrame d("ntuple", "hsimple.root");
d.Define("treeID",[](){return 1;}).Snapshot("treeName","fileOut.root",{"treeID","px","py","pz"});

Cheers

Lorenzo

jwruss · February 17, 2022, 10:15pm

Having not used ROOT::RDataFrame, could I ask for some elaboration in your lines of code and how to adapt them for my usage? I will break down my syntax questions into parts and subparts below.

Part 1:

ROOT::RDataFrame d("ntuple", "hsimple.root");

a) Looking into the examples under the DataFrame tutorials page, I see that this line sets up a DataFrame object referencing the ntuple object inside of hsimple.root binary file. If considering a TChain of files named like “hsimple0.root”, “hsimple1.root”, …, I could modify the above to

ROOT::RDataFrame d("ntuple", "hsimple*.root");

Correct?

Part 2:

 d.Define("treeID",[](){return 1;})

a) What is Define doing here? Is it creating an integer of value 1 named “treeID”? If so, how can I modify this to define a const char/string object?

Part 3:

d.Define("treeID",[](){return 1;}).Snapshot("treeName","fileOut.root",{"treeID","px","py","pz"});

a) I’m assuming what Snapshot is doing here is rewriting a TTree named “treeName” inside of “fileOut.root” to contain only the variables “treeID”,“px”,“py”,“pz”. If this is the case, how do I modify to point to TChain constructed with TFiles named “fileOut0.root”, “fileOut1.root”, …? Do I just replace “fileOut.root” in the Snapshot with “fileOut*.root”?

b) If “fileOut.root” can be replaced with “fileOut*.root” to address output to files in TChain, then can “fileOut*.root” instead be replaced by the initial “hsimple*.root” to say that the files in the initial TChain rewrite themselves?

c) If an RDataFrame object can be used to rewrite files in a TChain, then if I want to just add a new const char leaf to each treeName in each file, then do I replace {“treeID” , “px”, “py”, “pz”} with {“new const char object name”, “sequence of all object names already inside of treeName”}?

d) If replacing {“treeID” , “px”, “py”, “pz”} with {“new const char object name”, “sequence of all object names already inside of treeName”} is valid, then will it work if the “sequence of all object names already inside of treeName” contains user defined structures or branches? Do I just have to refer to the name of the structure or branch, or would I have to online the structure or branch content as well?

jwruss · February 19, 2022, 2:24am

Hi again Lorenzo,

Because of some compute node failure, it proves difficult to reproduce and add a “classID” TLeaf variable to the files referenced by the TChains added to my DataLoader objects using AddSignalTree() or AddBackgroundTree(). For my purposes I am just going to indeed refer to the “weight” variable under “TrainTree” and “TestTree”. The events in each TChain are assigned the same weight, different from the other TChains.

That all being said, I would appreciate any clarification you might provide to the questions I have asked in my previous post.

Best,
John

eguiraud · February 28, 2022, 11:54am

Hi @jwruss ,

That’s correct. The full RDF user guide, which also goes through the different way you can construct an RDF object, is at ROOT: ROOT::RDataFrame Class Reference .

See the user guide – Define creates a new logical column in the dataset, that is evaluated on the fly and as needed and you can then use as if it was a dataset column. In that case it creates a new column with values always equal to 1, but you can put arbitrarily complex expressions in there. You can return a std::string instead of an integer without issues. Define-ing columns of pointer type is not supported.

Snapshot’s documentation is in the ROOT reference, here. Snapshot can only produce a single tree, there is no option to split the output in multiple files.

Snapshot cannot be used to rewrite existing data. It can write out user-defined types and complex objects, as long as ROOT knows how to write those types. This typically requires providing dictionaries (see I/O of custom classes - ROOT ).

I hope this helps!
Enrico