SetTitle slows down RDataFrame performance

Hello,

this is probably a feature, but I run into something rather natural for data analysis that slows downs the RDataFrame performance. I attach a test program based on an existing tutorial.

The timing performance is good and behaves as expected with the default program:

Histo1D photon_eta   0.174297094345  [sec]
Histo1D photon_pt   0.00482892990112  [sec]
Histo1D photon_E   0.00204801559448  [sec]
Histo1D photon_ptcone30   0.0020010471344  [sec]
Draw photon_eta   26.5902540684  [sec]
Draw photon_pt   2.00271606445e-05  [sec]
Draw photon_E   5.00679016113e-06  [sec]
Draw photon_ptcone30   3.09944152832e-06  [sec]

However, if I turn on the line

varhistos[myvar].GetXaxis().SetTitle(myvar); 

just after defining the histogram model, it seems that RDataFrame loops for each histogram over the tree

Histo1D photon_eta   9.44745612144  [sec]
Histo1D photon_pt   9.23581314087  [sec]
Histo1D photon_E   11.1579310894  [sec]
Histo1D photon_ptcone30   13.5817921162  [sec]
Draw photon_eta   0.0102381706238  [sec]
Draw photon_pt   1.09672546387e-05  [sec]
Draw photon_E   2.86102294922e-06  [sec]
Draw photon_ptcone30   3.09944152832e-06  [sec]
>>> 

Naively I find surprising that setting a title in the histo model has such an effect.

May be a good warning for others.test4.py (919 Bytes)

Regards,

Tancredi


ROOT Version: /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.22.02/x86_64-centos7-gcc48-opt/bin/root
Platform: lxplus6


Hi,
RDF produces the results lazily, the first time you access them. If you access each result as soon as you register its computation with RDF, it will have to run several event loops rather than just one that produces all results at the same time.

As a general rule of thumb, produce all the RDF results you need first, and then use them/access them/call methods on them.

EDIT: note that in this case you can construct the model with the correct title in the first place

Cheers,
Enrico