i am trying to make a TMVA application in RDataframe, I have a model trained with vector/scalar variables. While I make use of the experimental features shown in https://root.cern/doc/master/tmva003__RReader_8C.html for my use case:
RReader model("data/xml/TMVAClassification_BDTG.weights.xml");
auto computeModel = Compute< 11 , float >(model);
auto variables = model.GetVariableNames();
auto df3 = df2.Define("mvaBDTG", computeModel , variables )
I have gotten error such as
Error in <TTreeReaderValueBase::GetBranchDataType()>: Must use TTreeReaderArray to read branch Electron_miniPFRelIso_chg: it contains an array or a collection.
Error in <TTreeReaderValueBase::CreateProxy()>: The branch Electron_miniPFRelIso_chg contains data of type {UNDETERMINED TYPE}, which does not have a dictionary.
I figured that the experimental feature could only handle scalar input variable; thus i made all input variable scalar (for vector variables, i take the first index in the array) and i have gotten another error:
RDataFrame::Run: event loop was interrupted
terminate called after throwing an instance of 'std::runtime_error'
what(): Size of input vector is not equal to number of variables.
Aborted (core dumped)
At this stage i am not sure how to debug this… Could you clarify that the experimental TMVA application could only read scalar variable? Is that possible to take vector as input in TMVA application under RDataframe, if not, what is the right way to do it?
Thanks and looking forward to hear from you.
Cheers,
Siewyan
Let’s see why the model has a different number of variables than your expectation. You can look at the variables with
auto variables = model.GetVariableNames();
Is the output what you expect? It could well be the that TMVA XML is not parsed correctly if they are vectors/objects involved (it’s experimental ). Next, you could have a look in the XML itself, it’s not too hard to figure out the fields, which hold the expected variables. Another possibility to debug would be using the old TMVA::Reader and see whether this works.
Thanks for the answer. I have checked the XML content which give me an idea what input the RReader is expecting. It consists of scalar and vector variables, if i parse it with
auto variables = model.GetVariableNames();
auto df3 = df2.Define("mvaBDTG", Compute< 11 , float >(model) , variables )
I got an error about unknown error
terminate called after throwing an instance of 'std::runtime_error'
what(): Unknown column: Electron_miniPFRelIso_neu
Aborted (core dumped)
The branch Electron_miniPFRelIso_neu is present in XML and it is derived from the expression
In order to have the equivalence in the Reader application, I have defined the column accordingly. Somehow the Reader is not able to pick it up… Do you know what causes it? Thanks!
Alright, I see. Yes, the TMVA::Experimental::Reader doesn’t have expressions implemented. Indeed, in a RDataFrame workflow I would keep this out of the Reader but put this logic into RDataFrame for the sake of simplicity. But I also see that this clashes with the existing TMVA interfaces. To be discussed in the future
Ah i see, so this is an expected feature from Experimental::Reader. May i know is there any way around (with expression implementation on variables) to evaluate BDT score in RDataframe with thread-safe way? Could you provide an example how to do it? Thanks!
Sry I missed your question at the end of your previous post.
If you want to stick to the TMVA::Experimental::Reader interface, you should define your expression as an own column in the training to get rid of the expression. However, since the BDT implementation in TMVA is not thread safe, the Reader will use a global lock to make it thread safe. Probably that’s not what you want.
You can instantiate a classic TMVA::Reader once per thread and assign these to do the evaluation naturally thread safe. In RDataFrame we have the DefineSlot interface (see here) to make this possible.
Sorry for the delay. I was assessing my option on how to take on this. I am eager to try out the experimental feature as it offer an elegant way to perform BDT score evaluation. If I understood, you are suggesting to defines those features during training to avoid expressive input; while on the application side i shall use DefineSlot to preserve thread safety?
If this is correct, Is there an example on how to perform training with RDataframe?
You cannot directly perform training with RDataFrame. You still have to do the training with TMVA and define the desired quantities as branches of the TTree. But you can use DefineSlot to use TMVA::Reader in RDataFrame and run on multiple threads.
HI,
Thanks again for the clarification. I am moving to work on using DefineSlot in my application. However, the DefineSlotdocumentation is not very enlightening to me…
I have the setup below, for example the lambda function defined as:
Here is a small example how you can integrate the TMVA::Reader (mocked by the Reader class there) in a multi-threaded RDataFrame workflow:
struct Reader {
float GetMvaValue(float x, float y) { return x * y; }
};
void test() {
// Enable MT and get the pool size
ROOT::EnableImplicitMT();
const auto poolSize = ROOT::GetThreadPoolSize();
// Create the TMVA::Readers
vector<Reader> readers(poolSize);
// Create a callable evaluating the readers per slot
auto eval = [&readers](unsigned int slot, float x, float y) { return readers[slot].GetMvaValue(x, y); };
// Create a RDF with 10 rows and two columns
ROOT::RDataFrame df(10);
auto df2 = df.Define("x", "(float)rdfentry_").Define("y", "(float)rdfentry_");
// Make the evaluation
auto df3 = df2.DefineSlot("mva", eval, {"x", "y"});
// Print the result
auto mva = df3.Take<float>("mva");
for(auto& x: mva) cout << x << endl;
}
Note that you want to instantiate the readers in a vector before and not in the lambda itself. Otherwise this will result in a horrible performance. You can create them upfront and put the object (or the pointer) in a vector, which you capture (see the readers&) the vector in the lambda.