Iteration over a tree in pyroot - performance issue

jprochaz · April 14, 2011, 10:15am

Hello,

if I iterate over a tree like that (see attached tree.py):

alist = [] for event in tree: alist += [event.x] print len(alist)

then my scripts takes (time python tree.py)

[code]choice = True
1000000

real 0m25.285s
user 0m25.402s
sys 0m1.300s[/code]

but if I “iterete” over the same tree like that:

tree.Draw("x","")

then it takes just a few seconds in total:

[code]choice = False
Info in TCanvas::MakeDefCanvas: created default TCanvas with name c1

real 0m3.550s
user 0m2.646s
sys 0m0.170s[/code]

The procedure of the timing is not the best one (it doesn’t take into account only the iteration) but it clearly show the difference. In my real script, I observe that the first case is 10 - 100 times slower then the second one (if I use also some cuts etc…). I can imagine that the pythonized iteration over a tree etc. takes some extra time but is it really so significant (factor of ~100 or even more)?

I really like the elegance of the first approach (compared to the TTree.Draw() method) but due to its performance I can hardly use it in my real case. Is there any chance how to avoid writing some c++ code or writing some strings which are passed to the Draw() command?

Many thanks in advance for any help/suggestion/hint etc.

Cheers,
Jiri
tree_test.root (90.4 KB)
tree.py (978 Bytes)

wlav · April 15, 2011, 9:45pm

Jiri,

although I really need to find the time one day to put some more work in the speedup, it is mostly a matter of getting rid of temporaries (end even then, factor of 10-100 is not all that surprising compared to a run in C++, though, can’t get rid of that unless using something like pypy). Code updates could be:[code] x = array(‘d’,[0])
tree.SetBranchAddress( “x”, x )

    alist, i = [], 0
    while tree.GetEntry(i):
        i += 1
        alist.append( x[0] )[/code]

For reference, ‘x’ isn’t cached on event, so the branch is looked up every time. SetBranchAddress() gets rid of that. Doing “+= []” creates a temporary list and forces an iteration, both of which are unnecessary. Using append gets rid of that. The looping structure over the tree is in python itself, I should move that to C++. The above code is a factor of 4 faster.

Cheers,
Wim

jprochaz · April 19, 2011, 1:57pm

Hi Wim,

I see. Thanks for the tips. It helped me.

I’ve also managed to speed up little bit (factor >2) analysis of my real tree ( > 600 branches) “activating” only the branches which I really need (see root.cern.ch/root/html/TTree.html#TTree:Draw%1) something like:

tee.SetBranchStatus("*",0) tee.SetBranchStatus("mybranch1",1)

It is still far behind the speed of run in C++ but it may be “good enought” in many cases.

Thank you again for your help.

Cheers,
Jiri

adye · June 10, 2011, 10:58pm

Hi,

I was going to post a request for just such a speedup, only to notice that this had already been discussed in this thread. Can I add my encouragement to your proposal for a more efficient TTree accessor by saving the branch address at the beginning of the loop. Apart from speed, the PyROOT way of using an accessor is so much nicer than the C++ way of having to set it all up by hand (or by code-bloating automatic generation in MakeSelector).

Another small, related, improvement would be to give an accessor error (eg. AttributeError) if the branch has been disabled with SetBranchStatus. If one uses SetBranchStatus to enable just those trees one needs, it is easy to get a subtle bug by using a variable but forgetting to enable it. It would be good to trap these bugs.

As your example indicates, it is possible to implement these improvements in Python, rather than at the C++ level. With attribute properties, it is even possible to maintain the accessor calling method. In case it is of interest, I attach a PySelectorBase class (see SetupTTreeClass at the end) that implements this (along with an example subclass). With this, my jobs run 50-75% faster. That should give an idea of the speedup (at a minimum) we could all get if this were improved in PyROOT.

Regards,
Tim.
pyselector.py (1.39 KB)
PySelectorBase.py (8.79 KB)

wlav · June 10, 2011, 11:06pm

Tim,

thanks … I’ve been pre-occupied with PyPy/cppyy lately for that that I’d otherwise spend on PyROOT (there, I’m after orders of magnitude of speed improvement ).

I’m traveling for the next two weeks. I hope to get around this after the travel.

Thanks,
Wim

adye · June 10, 2011, 11:08pm

[quote=“jprochaz”]
I’ve also managed to speed up little bit (factor >2) analysis of my real tree ( > 600 branches) “activating” only the branches which I really need (see root.cern.ch/root/html/TTree.html#TTree:Draw%1) something like:

tee.SetBranchStatus("*",0) tee.SetBranchStatus("mybranch1",1)[/quote]I’ve found that using branch.GetEntry(i) is even faster than that - just a bit. Eg.

def Init (self, tree):
  self.branches= []
  for var in ["x","y","z"]:
    self.branches.append (tree.GetBranch(var))

def Process (self, entry):
  for branch in self.branches:
    if branch.GetEntry (entry) <= 0: return 0
  xhist.Fill (tree.x)
  ...

In my case (a large ntuple where I am only reading a few of the variables), I got a 10% speedup of my program compared to using SetBranchStatus, which was already more than twice as fast as reading all branches.

Tim.

RongkunWang · September 19, 2017, 9:02am

Hi Tim,

Sorry to come back to this old thread.

I think I found a method based on yours that improves the speed by 3~4 times faster.

we can try to assign attribute to self, or any temporary structure, and then call it explicitly by [0], if it’s a single value. (that’s the inconvenient part. But at least for accessing vector it’s necessary to call with an index.)

def Init(self, tree):
   l_interested = ["x","y","z"]
   for _name in l_interested:         
            br = tree.GetBranch(_name)     
            n = 1          
            if hasattr(tree, _name):       
                val = getattr(tree, _name) 
                cls_n = val.__class__.__name__           
                # GetBranch() doesn't work well with vector                                                                                
                if cls_n.find("Py") != -1 and cls_n.find("Buffer") != -1:
                    n = len(val)           
            if br:         
                self._branch[_name] = br   
                type_char = br.GetListOfLeaves().At(0).GetTypeName()
                tp = self._type_convert(type_char)
                           
                setattr(self, _name, array.array(tp, [0 for i in xrange(n)]))
                tree.SetBranchAddress(_name, getattr(self, _name))
def Process(s, entry):
    for branch in self._branch.values():
       if branch.GetEntry (entry) <= 0: return 0
   xhist.Fill (s.x[0])

same code, just change the reading method, setbranchaddress is about ~15-20 seconds, your method is ~65-70 seconds.

where you can find how type_char is converted to python array tp here:

https://github.com/rootpy/rootpy/blob/master/rootpy/tree/treetypes.py

the second argument for constructing BaseScalar.new() is the tp

cheers,
Rongkun

adye · September 19, 2017, 1:54pm

Hi Rongkun,

Thanks for the nice improvement. If I understand your code (and remember mine!), I may have been doing something similar in my PySelectorBase.py class. Did you do your timing comparison with PySelectorBase.py, or with the branch.GetEntry() example above (that was just one of the optimisations)?

I guess your method is even faster because it sets the cached array directly as a class attribute, whereas PySelectorBase.py uses a setattr accessor function for the cached value. That’s why I’d be especially interested to understand if the impressive speedup you obtained was due to this difference, or due to the improvement with respect to default PyROOT accessor (which both our methods replace).

The main advantage of my code is that it doesn’t require adding [0] to scalar variables. It would be nice to avoid that, mainly for compatibility with vanilla PyROOT. However if the performance penality is as large as you found, then I think it’s worth it!

If we can maintain compatibility, I wonder if we can get these optimisations included into PyROOT, so they are available by default?

Thanks,
Tim.

RongkunWang · September 19, 2017, 7:28pm

Hello Adye,

Short reply:
I didn’t try your code since I’m more interested in the method you provided in the last reply. But I just read that it’s doing something similar.

I’ve also tried setattr for all the branches at the start of looping, sth like:
self.val = self.b_val[0]
And the time just increased 2~3 times for me (yesterday the number was from 15 secs to 38 secs), because copying does take time. but it’s still good that doing this will only copy it once, if we use entry.val, every time it’s called, I suspect there’s a copy happening.

Actually, I don’t know what is vanilla PyRoot. I guess you can use getitem directly, instead of , but yeah, with more typing… or add a wrapper like def r(s.val, i=0): return s.val.__getitem__(i) to save some typing. Also, in your PySelectorBase.py, I can see you also used in MakeAttributeValue_CacheAddress? does that part also work under vanilla PyROOT?

===================================
The following text are written in a convoluted way because I edited and summarized it at different time, sorry.

More detailed test I did:
So what I’m doing is use GetBranch and GetEntry for all of them like you do, but mixed SetBranchAddress in Init().

In total I did four tests.

slowest:

SetBranchStatus() for branches used
For entry in tree:
	entry.val

medium time consumption: is your method:

<GetBranch() for branches used>
For entry in xrange(tree.GetEntries()):
	for br in branches:
		br.GetEntry(entry)
	# I suspect this line will cause time no matter SetBranchStatus / Get only the branches
	# because it contains copying.
	entry.val

my method:

<The pieces of code to automatically setup array.array type>
For entry in xrange(tree.GetEntries()):
	for br in branches:
		br.GetEntry(entry)
	# this is a bit inconvenient:
	self.val[0]
	# in case the val is a vector: (that is what one is expected to do..)
	self.val[1]

based on method 3, but after br.GetEntry(entry), I setattr for things inside branches, to avoid calling [0] later. That is 2-3 times slower than method 3.

Following are time comparison between method 2 and 3:

I’m running on 566346 events, reading over 7 branches, Module [apply_cut] and [fill_hist] are done once per event.
In [apply_cut] module, I’m reading 1 or 2 (let’s say 99% of the time it’s 2) of the variables.
In [fill_hist] module, I’m reading 5 of the variables.
In the loop, I’m also reading 2 variables. So the rest of the time are spent on looping.

Due to my time analysis, if I use your method, GetBranch, and get the value still by tree.val, the filling of histogram took ~50% of the time. The rest are mostly spent on looping(retrieving variables by writing to memory by setattr()). you can see follows:

######## start my time analysis for [HistogramProcessor]
Module [total] spent time: 95.9731440544, 100.00% wrt to total time
Module [register_hist] spent time: 0.547441959381, 0.57% wrt to total time
Module [apply_cut] spent time: 16.7197921276, 17.42% wrt to total time
Module [fill_hist] spent time: 48.0654530525, 50.08% wrt to total time
Module [write] spent time: 0.037024974823, 0.04% wrt to total time
######## finish time analysis for [HistogramProcessor]

I’m curious about this behavior since tree.x seems to call some wrapper and do copy again and again, even with some existing variable… so the more you fill the more time you spent. You can see that fill_hist is ~ 3 times the apply_cut, while 5 branches / 2 branches =2.5 ~ 3.

But if I use SetBranchAddress, I guess the variable are copied out only when GetBranch, so i guess that’s why it saves more time. Time analysis follows:

######## start my time analysis for [HistogramProcessor]
Module [total] spent time: 27.0103001595, 100.00% wrt to total time
Module [register_hist] spent time: 0.278868913651, 1.03% wrt to total time
Module [apply_cut] spent time: 1.43743085861, 5.32% wrt to total time
Module [fill_hist] spent time: 7.2391602993, 26.80% wrt to total time
Module [write] spent time: 0.0350120067596, 0.13% wrt to total time
######## finish time analysis for [HistogramProcessor]

And those two code uses the same histogram bookkeeping utility, just changed reading of branch.

RongkunWang · September 20, 2017, 11:16am

I did a second test here:

Now I’m filling 100 more times, each fill will call only 1 weight branch. It’s a bit surprising that method 1 makes almost no difference to method 2. And that sort of confirms my guess that calling tree.val is most time consuming because it render copying.

method 1 - SetBranchStatus:

######## start my time analysis for [HistogramProcessor]
Module [total] spent time: 626.642473221, 100.00% wrt to total time
Module [write] spent time: 0.0350461006165, 0.01% wrt to total time
Module [apply_cut] spent time: 12.2305030823, 1.95% wrt to total time
Module [fill_hist] spent time: 588.592946291, 93.93% wrt to total time
######## finish time analysis for [HistogramProcessor]

method 2 - GetBranch:

######## start my time analysis for [HistogramProcessor]
Module [total] spent time: 622.827071905, 100.00% wrt to total time
Module [register_hist] spent time: 0.320149183273, 0.05% wrt to total time
Module [apply_cut] spent time: 12.6592998505, 2.03% wrt to total time
Module [fill_hist] spent time: 583.468718529, 93.68% wrt to total time
Module [write] spent time: 0.0403349399567, 0.01% wrt to total time
######## finish time analysis for [HistogramProcessor]

method 3 - GetBranch & SetBranchAddress:

######## start my time analysis for [HistogramProcessor]
Module [total] spent time: 162.829643965, 100.00% wrt to total time
Module [register_hist] spent time: 0.310490846634, 0.19% wrt to total time
Module [apply_cut] spent time: 1.28536581993, 0.79% wrt to total time
Module [fill_hist] spent time: 142.716985703, 87.65% wrt to total time
Module [write] spent time: 0.038684129715, 0.02% wrt to total time
######## finish time analysis for [HistogramProcessor]

Second round of test:
Still, I’m filling 100 more times, each fill will call 2 branches now.
The expectation for the test is that the time basically scales with number of branch you read(doubled), in method 1 and 2, but almost not changed in method 3. And from below you can observe this, too. That means TH1D.Fill cost a constant time and that’s not comparable to copying value.

method 1 - SetBranchStatus:

######## start my time analysis for [HistogramProcessor]
Module [total] spent time: 1227.96466684, 100.00% wrt to total time
Module [write] spent time: 0.0259609222412, 0.00% wrt to total time
Module [apply_cut] spent time: 12.246420145, 1.00% wrt to total time
Module [fill_hist] spent time: 1190.05335879, 96.91% wrt to total time
######## finish time analysis for [HistogramProcessor]

method 2 - GetBranch:

######## start my time analysis for [HistogramProcessor]
Module [total] spent time: 1208.65821695, 100.00% wrt to total time
Module [register_hist] spent time: 0.21132683754, 0.02% wrt to total time
Module [apply_cut] spent time: 12.217356205, 1.01% wrt to total time
Module [fill_hist] spent time: 1170.8909297, 96.88% wrt to total time
Module [write] spent time: 0.032518863678, 0.00% wrt to total time
######## finish time analysis for [HistogramProcessor]

method 3 - GetBranch & SetBranchAddress:

######## start my time analysis for [HistogramProcessor]
Module [total] spent time: 159.838068008, 100.00% wrt to total time
Module [register_hist] spent time: 0.220459938049, 0.14% wrt to total time
Module [apply_cut] spent time: 1.2727534771, 0.80% wrt to total time
Module [fill_hist] spent time: 141.467209578, 88.51% wrt to total time
Module [write] spent time: 0.0273330211639, 0.02% wrt to total time
######## finish time analysis for [HistogramProcessor]

Cheers,
Rongkun

jprochaz · September 20, 2017, 3:05pm

Hi,

there is also, currently experimental, ROOT’s TDataFrame, see: https://root.cern.ch/doc/v610/classROOT_1_1Experimental_1_1TDataFrame.html which looks to me very promising and may provide several improvements over TTree (e.g., parallelization).

I’ve, however, no experience with it - especially how it interact with python at the moment. There are some tutorials also for python: https://root.cern.ch/doc/master/group__tutorial__tdataframe.html

I very like the idea/possibility that c++ version of TDataFrame uses standard c++ functions for definition of cuts instead of using strings. In the python tutorials I can, however, see only passing some strings to filter events, not python functions (which would be really nice but probably at a cost of performance penalty; or it is maybe just not yet implemented).

Cheers,
Jiri

adye · September 20, 2017, 4:10pm

Hi Rongkun,

Thanks for all the careful timing results. I hope to be able to test your technique when I next have to read large ntuples from PyROOT. In the meantime, I have a little more feedback:

Sorry, I just meant what you get by default with import ROOT and iterate over a tree. There’s a default attribute for each variable to load the current value, eg. tree.x. This is what you replace with setattr().

Yes, I used a wrapper, but it should be more efficient than that. My wrapper (attribute accessor function) returns the value directly from the array. It is implemented with a closure (a type of nested function), something like this:

tp = self._type_convert(type_char)
ar = array.array(tp, [0])
def mygetf(self):
    return ar[0]
setattr(self, _name, mygetf)
tree.SetBranchAddress(_name, ar)

(where I tried to adapt to your example, but not tested). That should allow scalar access without the [0]. Of course you can use your original code for vector types. If you are interested, you could see how much of a performance hit you get doing this in your code, compared to the [0] access with your fastest method.

Yes, my code works without any extra libraries. In particular, it has its own (simple-minded) implementation of what _type_convert does, instead of using rootpy.

Thanks,
Tim.

RongkunWang · September 20, 2017, 5:11pm

Hi Adye

I believe this line: setattr() will increase some time usage. because it render copying, and it has to be done for each entry. Otherwise everything I did was the same with yours.

About rootpy, and vanilla PyRoot, there must be some misunderstanding in my previous reply.
I also only use import ROOT !

Here’s my fulfillment of type_convert. Similar to your dictionary:

def type_convert(n):
    """                                                                                                                                                                                                       
    convert ROOT type name to python array specifier
    """
    tp = ""
    if n == "Int_t":
        tp = 'i'
    elif n == "ULong64_t":
        tp = 'L'
    elif n == "Float_t":
        tp = 'f'
    elif n == "Double_t":
        tp = 'd'
    elif n == "Bool_t":
        tp = 'i'
    return tp