TChain and TFormula correct usage. Enable to re-use the TChain afterward

RENATO_QUAGLIANI · February 2, 2019, 8:43am

Dear ROOT experts,
I googled a bit the usage of TFormula with TChain and I may have found a solution to that. I just wanted to share and ask if there are drawbacks in the approach i am following.

Here the body of the method I wrote to build the content of a struct ( namely IsoBinReport )

struct IsoBinReport{
   double sumOfWeightsPas;
   double sumOfEntriesPas; 
   std::vector< std::pair< double, double > > varFill_weightFill;
}
IsoBinReport GetIsoBinReport( TChain* _eventType , TString _VarName  ,double min, double max, const TCut  & _Cut , const TString & _Weight , double frac  , bool debug ){
  IsoBinReport Report; 
  Report.sumOfWeightsPas = 0; 
  Report.sumOfEntriesPas = 0;
  Report.varFill_weightFill = {};
  Long64_t _nEntries = _eventType->GetEntries() ; 
  if( frac > 1 && frac < _nEntries){ _nEntries = frac ; }
  if( frac > 0 && frac < 1 ){ _nEntries = floor( _nEntries * frac); }

  TString _CutExpression    = _Cut  ;
  TString _WeightExpression = _Weight ;
  //get vector<TString> given the Expression and TChain 
  auto branches = GetBranchesFromExpression( _CutExpression, _eventType )   + 
 GetBranchesFromExpression( _WeightExpression, _eventType ) + GetBranchesFromExpression(_VarName, _eventType ); 
  _eventType->SetBranchStatus("*",0);
  for( auto & b : branches){ 
    _eventType->SetBranchStatus(b,1);    
  }

 // HERE
   _chain->LoadTree(0); // < load the TFile linking to entry 0 
  TTreeFormula *cutFormula    = new TTreeFormula("CUT",_CutExpression,_eventType );
  _eventType->SetNotify( cutFormula); //Is this always required ? 
  TTreeFormula *weightFormula = new TTreeFormula("WEIGHT",_WeightExpression,_eventType );  
  _eventType->SetNotify( weightFormula);  //Is this always required ?
  TTreeFormula *toPlot        = new TTreeFormula("VAR", _VarName , _eventType );
  _eventType->SetNotify( toPlot);  //is this always required ?

  boost::progress_display show_progress_evtloop( _nEntries );    
  Report.varFill_weightFill.reserve(_nEntries);
  int underflow = 0;
  int overflow  = 0;
  for( Long64_t entry = 0 ; entry < _nEntries; ++entry){
       ++show_progress_evtloop;       
       _eventType->LoadTree(entry); //Is this required ?
       _eventType->GetEntry(entry); 
       cutFormula->UpdateFormulaLeaves();  //Is this required ?
       toPlot->UpdateFormulaLeaves(); //Is this required ?
       weightFormula->UpdateFormulaLeaves(); //Is this required ?  		 
       bool   _cutISPas   = (bool) cutFormula->EvalInstance(0);
       double _val        =         toPlot->EvalInstance(0); 
       double _weight     =     weightFormula->EvalInstance(0); 
       Report.sumOfEntries ++;        
       if( _cutISPas == false ){ continue;}
       if( _val < min){
       	  underflow++;
       }
       if ( _val > max){
       	 overflow++;
       }
       Report.sumOfEntriesPas ++; 
       Report.varFill_weightFill.push_back( make_pair(_val, _weight) ) ;      
       Report.sumOfWeightsPas += _weight;
  }
   //re-enable all branches
  _eventType->SetBranchStatus("*",1);
  _eventType->LoadTree(0); //I must do this to enable the re-usage of the TChain afterwards, otherwise i get a nasty segfault from TBuffer
  delete cutFormula;
  delete weightFormula;
  delete toPlot;
  return Report;
};

I can polish down the method a bit more, but the questions i have are all in-lined with the code.
Is there anyone which knows what is the correct functions to use to enable the behaviour i implemented?
I basically have a Cut-Expression, a Weight-Expression, a VariableExpression whcih is used to fill afterwards an Histogram.
I parse the 3 as a TFormula, and I eval their value at each entry of the TChain so i can fill that.
I know this is roughly the same of TDataFrame or TTreeReader or Draw or whatever, but I would like to keep this code as it is since i want to spawn over a vector of cuts, set of weights and list of variables and do a loop over the Tree only once. This is the 1-1-1 case which i managed to get working. My questions is : is there some drawbackws in the way i use the various methods around ?

///LoadTree, SetNotify, UpdateFormulaLeaves, and at the exit the LoadTree(0).

THe code i wrote is working, but this doesn’t imply i am doing things correctly or at least that I could have some drawbacks in doing this. So my question. Do you think this is the correct way of doing that ?

Thanks In advance
Renato

_ROOT Version: 6.14/02
Platform: MacOs
Compiler: Not Provided

Danilo · February 4, 2019, 9:20pm

Hi Renato,

the tree loading and formula leaves update should not be necessary. Did you try without and had a problem?

Most importantly, what is missing in RDataFrame for you to carry out your task? One of its most important features is to do many things and one event loop only.

Cheers,
Danilo

RENATO_QUAGLIANI · February 5, 2019, 7:26am

Hi Danilo,
There is nothing missing. I should (as well all my coworkers) upgrade Root to get the RDataFrame. I found some difficulties to book say N histograms with M cits and R weights to test in a systematic way. Now the routine i wrote dumps the vector of pair of doubles for values and weight which then i sort and slice in equal pieces to get an array of values to pass to build an isobinned histogram which then ai can use to make efficiencies ratios. For example here i book N consecutive cuts, and i wantt to check the efficiency of each selection one after anotger across one observable. The machinery needed for RDataFrame is surely all there, what i am missing is to experience it expecially to produce weighted histograms and force coworkers to make extensive use of lambdas.
Renato

Danilo · February 5, 2019, 9:38am

Hi Renato,

I see: thanks for your answer. We are here to help in case you need support.
One thing which can help your colleagues is the possibility to use strings instead of lambdas and rely on RDF to do the rest. The change from TTreeFormulas and alike should be minimal or at least bearable since you can use any C++ in such strings. See for example:
https://root.cern/doc/master/classROOT_1_1RDF_1_1RInterface.html#af415d0a369aaa449492563f47a13fd37

Cheers,
D

RENATO_QUAGLIANI · February 6, 2019, 12:00am

Hi Danilo,
I ended up to write an interface for users , i put in attachment the .h and .cpp file.
Do you see any straight forward way i can use to “fill” the vectors holding the vector, vector. vector for value,weight, selection status ?

Note that i fully rely on filling them in the same order, then i return copy to them to do isobinning.
RXDataPlayer.hpp (9.0 KB)

RXDataPlayer.cpp (9.6 KB)

A side note, when going to process our heaviest ntuple :

A chain of 20 draws takes 30 minutes, with this 2 minutes.
The last boost would be given by RDataFrame.

The main idea is to have

map<selection, vector<bool> > 
map<observable, vector<double>>
map<weight, vector<double> >

retrieve the set of 3 vectors and do a zip loop with boost libraries to fill histograms (1,2-D), use copy of them to slice things in isobins etc…

pcanal · February 6, 2019, 7:38pm

       _eventType->LoadTree(entry); //Is this required ?
       _eventType->GetEntry(entry); 
       cutFormula->UpdateFormulaLeaves();  //Is this required ?
       toPlot->UpdateFormulaLeaves(); //Is this required ?

The GetEntry (or more exactly the branch specific GetEntry) will be done by TTreeFromula. Doing a TTree::GetEntry here is a pessimation as it reads more data than needed. Since there is no (need for a) GetEntry the LoadTree is necessary to indicate which entry you want to read.

The UpdateFormulaLeaves is necessary but only when the TChain switch from one file to the other. You can either monitor TChain::GetTreeNumber or attach a callback object (TChain::SetNotify) that will be called whenever the TChain opens a new file.

Danilo · February 11, 2019, 8:00pm

Hi Renato,

I was away for the last few days and lost track of this thread, apologies.
So I understand that

Histogramming now is much faster than before (2 min VS 30 mins)
You are after a mechanism that allows you to “extract” from processing of the data 3 vectors (2 vector<double> and 1 vector<bool>)

is this correct?

Cheers,
Danilo

system · February 25, 2019, 8:00pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.