Performance enhancing drug

rev_karol · June 15, 2006, 2:41pm

In the PyROOT tutorial it shows how to access values in a TTree

something along the lines of :

for i in xrange(entries) :
   t.GetEntry(i)
   x = t.ch

where t is the ttree object and ch is one of its leaves.

The thing is the line above is significantly faster in C++.
I think this is because of the SetBranchAddress function used in C++. If we use this in python we have to give it a reference (which doesn’t really exist in python). We can use the python array module to spoof this, like so :

from array import array

ch=array('i', [0])
t.SetBranchAddress('ch', ch)

for i in xrange(entries) :
   t.GetEntry(i)
   x = ch

You can get a serious performance gain this way.

I’m seeing a 2x - 4x speed increase (possibly more) in my loop depending on how many variables your optimising.

Karol.

wlav · June 15, 2006, 4:31pm

Karol,

yes, at the moment the direct access doesn’t cache the array address, which slows things down considerable for multiple access (I’ve had problems updating the addresses properly when needed). That said, I don’t think you have to set an array yourself, but ‘x = t.ch; x[0]’ etc. should speed up already, compared to ‘t.ch[0]’.?

I’ll look into the details.

Cheers,
Wim

rev_karol · June 15, 2006, 4:44pm

No, you’re right, I should have x = ch[0] on that last line

wlav · June 16, 2006, 10:38pm

Karol,

ok, … two more things then: first, “x = t.ch[0]” has two calls from python: the member lookup and the getitem (sortof anyway) as opposed to just one for “x = ch[0]” in the other case. This will always cost you something of the order of a factor of 2. And second, the “t.ch[0]” is range checked for the actual number of entries read into the array, the “ch[0]” is unbounded. The range check happens through a callback and may be expensive (I’ve never measured it directly, so I don’t have the relative numbers).

Actually, the only timing I ever did was on an example that was limited by disk I/O.

I’ll improve the documentation and then think of different ways/styles of using TTrees, and how to support each. Maybe something like the ability to switch of range checking in an “optimised” mode or so.

Cheers,
Wim

poss · November 14, 2006, 2:32pm

Hello,
When using this trick to speed up my loop, I get this error :
hist1.Fill(tuple.Event[0])
TypeError: unsubscriptable object

I have this code before :

from array import array Event=array('i', [0]) f = TFile("temp.root") tuple = f.Get("1") tuple.SetBranchAddress('Event', Event) nentries = tuple.GetEntries() for i in range(nentries) : tuple.GetEntry(i) hist1.Fill(tuple.Event[0])
What am I doing wrong ?

Thanks in advance.

S POSS

rev_karol · November 14, 2006, 5:37pm

Just get rid of the tuple. bit in your last line i.e.,

poss · November 15, 2006, 8:20am

Hi,
First of all, thanks for the hint… But I just find that this does not improve a lot the time of exectuion (less than 1 second over 16 seconds)… Using PAW, this used to be faster… Is there any other hint to get it faster ???

Thanks in advance.

S POSS

Axel · November 15, 2006, 10:46am

Hi Stephane,

the proper comparison would be with PyPAW. Not sure if that was ever written, though. But as you probably understood from the postings this inefficiency is a property of the python layer. So if you want it really fast you should write this piece of code in C++.

Cheers, Axel.

poss · November 15, 2006, 11:05am

Thanks, I’ll have a look at de C++ version…

Cheers,

S POSS

rev_karol · November 15, 2006, 11:13am

16 seconds is your total run time? Well if that’s the case it’s not the loop that’s the speed problem. It’s the start up time. It takes time to load the ROOT libraries and read the nTuple etc.

Here’s the time difference I get for an nTuple with 10^7 entries

$ time ./foo.py

real 3m55.541s
user 3m31.483s
sys 0m28.026s

$ time ./foo.py

real 1m46.928s
user 1m25.169s
sys 0m23.768s

wlav · November 15, 2006, 6:15pm

Axel,

there is inefficiency compared to compiled code, of course, but the layer is doing additional (IMO) useful work, such as making sure you can’t access more objects than have actually been read from file. As such, I’ve inverted the defaults as per C++ (or PAW for that matter), where you don’t get data verification unless you code it in yourself.

That said, I do know I can make the layer more efficient than it is today.

Cheers,
Wim

Axel · November 16, 2006, 10:41am

Hi Wim,

I really did not want to criticize any of your efforts with PyROOT (and I thought I put it neutral enough so you don’t feel like you have to defend anything). Sorry if I did. We all know how useful PyROOT is, given the problems and aversions many people have with C++.

I just tried to point out that ROOT is C++, and python has to interface with that. So while you might end up with a more efficient interpreter than CINT (and I bet that thanks to your optimizations in some areas PyROOT already is more efficient), nothing will top the speed of a compiled C++ macro. So this is what PAW should be compared with - the fastest version immediately available at the prompt.

Cheers, Axel.

wlav · November 16, 2006, 6:02pm

Axel,

no problem; but this week there’ve been several discussions in ATLAS along similar lines, so I felt like pointing it out.

As well:[quote]So while you might end up with a more efficient interpreter than CINT[/quote]Nothing to do with optimizations: it’s the difference in language. Python will always beat a single dispatch over one in interpreted C++. The latter has to deal with typedefs, macro’s, global operator overloads, and a sophisticated scheme of scopes, which all can totally change the actual meaning of the same line of code; python’s evaluation otoh is much more straightforward. Conversely, python’s loops are very slow, because no static optimization is possible (unless one resorts to list contractions which I consider a syntactic horror ), so a tight (interpreted) C++ loop will always beat a python one, given a sufficient number of iterations (a loop with a long body will equalize again).

[quote]nothing will top the speed of a compiled C++ macro[/quote]Sorry, I have to disagree with you here. If you want to get top speed out of a ROOT application, you’ll have to read the manual and learn how to optimally use it. The choice of language is a secondary consideration.

And that goes for PAW also: http://paw.web.cern.ch/paw/cr.html.

Cheers,
Wim

rev_karol · January 10, 2007, 10:37pm

Hi Wim,

I was just looking back at this. Interesting. Do you know if the same applies in ROOT?

Karol.

wlav · January 12, 2007, 6:57am

Karol,

naively, I’d say yes, because the fundamentals are still the same: don’t read what you don’t need, and don’t create too fine a granularity b/c it increases overhead.

There are differences, though: in a TTree, one tends to save classes, rather than builtin types as used to be the case in PAW. The creation and filling of a class will change the timing, and accessing a class is different than accessing a builtin b/c accessing a class does not necessarily mean that you’ll access all its data members (which themselves can be split, of course).

Cheers,
Wim

ductng · May 15, 2007, 4:10pm

Hallo,

to follow up and clearify the performance enhancing drugs, I did some test s myself with a simple tree.
In your discussion some things were a bit confusing, because Karol optimized the access to an single integer value and Wim was talking about an array.

In my example I have an array Jet_match2all_DeltaPt and an integer Jet_match2all_N which contains the array’s length.

Here is the code for my test (which does actually loop over all Jet_match2all_DeltaPt values and does nothing with the values):

#!/usr/bin/python

from ROOT import *
import time
from array import *

def method1():
	f=TFile("5200.root_100.root")
	t=f.Get("commonTree")
	
	t.LoadTree(0)
	
	start=time.time()
	for entry in xrange(1000):
		t.GetEntry(entry)
		nr=t.Jet_match2all_N
		for i in xrange(nr):
			v=t.Jet_match2all_DeltaPt[i]
	print "method 1",time.time()-start
	del(t)
	f.Close()
	del(f)
	
def method2():
	f=TFile("5200.root_100.root")
	t=f.Get("commonTree")
	
	Jet_match2all_DeltaPt=t.Jet_match2all_DeltaPt
	
	t.LoadTree(0)
	
	start=time.time()
	for entry in xrange(1000):
		t.GetEntry(entry)
		nr=t.Jet_match2all_N
		for i in xrange(nr):
			v=Jet_match2all_DeltaPt[i]
	print "method 2",time.time()-start
	del(t)
	f.Close()
	del(f)
	
def method3():
	f=TFile("5200.root_100.root")
	t=f.Get("commonTree")
	
	Jet_match2all_N=array("i",[0])
	t.SetBranchAddress("Jet_match2all_N",Jet_match2all_N)
	
	t.LoadTree(0)
	
	start=time.time()
	for entry in xrange(1000):
		t.GetEntry(entry)
		nr=Jet_match2all_N[0]
		for i in xrange(nr):
			v=t.Jet_match2all_DeltaPt[i]
	print "method 3",time.time()-start
	del(t)
	f.Close()
	del(f)
	
def method4():
	f=TFile("5200.root_100.root")
	t=f.Get("commonTree")
	
	Jet_match2all_N=array("i",[0])
	t.SetBranchAddress("Jet_match2all_N",Jet_match2all_N)
	
	Jet_match2all_DeltaPt=array("f",[0.0]*10)
	t.SetBranchAddress("Jet_match2all_DeltaPt",Jet_match2all_DeltaPt)
	
	t.LoadTree(0)
	
	start=time.time()
	for entry in xrange(1000):
		t.GetEntry(entry)
		nr=Jet_match2all_N[0]
		for i in xrange(nr):
			v=Jet_match2all_DeltaPt[i]
	print "method 4",time.time()-start
	del(t)
	f.Close()
	del(f)

method4()
method3()
method2()
method1()

method1 uses the method described in the pyRoot tutorial, which is using the variables in the tree directly.

method2 caches the array t.Jet_match2all_DeltaPt (in reality is seems to be an “read-write buffer ptr”) in Jet_match2all_DeltaPt, before the loop and which can be used as an normal array. The integer variable is not optimized, because you cannot “cache” the value before the loop (it will be set to zero and stays the constant value 0). This is what Wim described (but Karols variable seems to be an integer, so with Karols example that does not work).

method3 is what Karol described, you set the branch address to an array.array (to make clear that the array from the python module array is meant). For a single integer you have to set the address to an integer array.array of length 1. Since Jet_match2all_N is now an array.array, you have to access it by Jet_match2all_N[0] (which is a bit of overhead for accessing a single integer). The array Jet_match2all_DeltaPt is accessed as in method1.

method4 sets the branch address for the integer and the array. Here Jet_match2all_DeltaPt is an array.array of a certain length, that you must set before the loop. I.e. you must know what the maximum number of array elements can be (here I set it to 10 which seems to be safe for me).

The execution time (measured only over the loop) on my computer is:
method 4 0.63s
method 3 1.35s
method 2 0.82s
method 1 1.32s

Here are the same numbers for a psyco.profile() optimization and running each method twice:
method 4 2.02s
method 3 1.13s
method 2 0.75s
method 1 1.21s

method 4 0.54s
method 3 1.14s
method 2 0.74s
method 1 1.22s

Here are the same numbers for a psyco.full() optimization and running each method twice:
method 4 3.31
method 3 1.79
method 2 0.75
method 1 1.14

method 4 0.56
method 3 1.07
method 2 0.75
method 1 1.14

The statistics is not so good, but the numbers show that method4 is the fastest and that method1 cannot achieve the same perfomance, even if is it optimized with psyco. method2 which avoids the unnatural access to an integer with “[0]” is also having an acceptable performance improve.

I hope I could help those that are unsure about the best to optimize their loop reading code.

Cheers,
Duc

bjrnfrdnnd · October 11, 2007, 3:49pm

Hi there,
I followed your discussion which focused on the increase of speed via
the allocation of memory for the contents of a given row of a tree.

It occurs to me, though, that python is at its worst when iterating
the
for i in range(tree.GetEntries():
tree.Get(i)

stuff.
Wouldn’t it be possible to somehow copy an entire column (or a subset of an entire copy) to an array?

Something like that:

t1.Branch( ‘mynum’, n, ‘myint/I’ )
t1.Branch( ‘mynum’, n, ‘myfloat/F’ )
.
. fill the tree
.
.

N=1e6
n=scipy.zeros(N,dtype=‘int’)
f=scipy.zeros(N,dtype=‘float’)

.
.
and now, don’t know, something like
f=tree.GetEntriesLikeILikeIt(index_min=0,index_max=N,“mynum”)

Just to get the idea across: I want to read chunks of rows for given columns, because python’s iterating speed is so low.
Which is basically the reason why there is the array module, which implements all those loops on arrays in c++.

bjrnfrdnnd · October 11, 2007, 4:59pm

Just to show you the speed difference.
Consider the following programs:

reading.py:

import ROOT 
import scipy, sys


def main(argv):
  InputFile = ROOT.TFile("test_long.root")


  InputFile.GetListOfKeys().Print()
  
  tree=InputFile.Get("T")


  N=tree.GetEntries()
  t=scipy.zeros(1,dtype='double')
  X=scipy.zeros(1,dtype='double')
  Xa=scipy.zeros(1,dtype='double')
  XX=scipy.zeros(N,dtype='double')
  tree.SetBranchAddress("t",t)
  tree.SetBranchAddress("X",X)
  tree.SetBranchAddress("Xa",Xa)
  for i in range(N):
    nb=tree.GetEntry(i)
    XX[i]=X


  print XX
   
if __name__ == "__main__":
  sys.exit(main(sys.argv))

[ul]
time python reading.py
[/ul]
TKey Name = T, Title = A Root Tree, Cycle = 52
[-50.90060642 -52.41524891 -53.90341458 …, -81.35773006 -86.73374523
-88.61698295]

real 0m2.313s
user 0m2.004s
sys 0m0.316s

and the corresponding c++ program

reading.cxx:

#include <TTree.h>
#include <TFile.h>
#include <iostream>
using namespace std;
void tree1r(TTree *t1) 
{
  Double_t t, X, X
  t1->SetBranchAddress("t",&t);
  t1->SetBranchAddress("X",&X);
  t1->SetBranchAddress("Xa",&Xa);
  //read all entries 
  Int_t nentries = (Int_t)t1->GetEntries();
  Double_t * XX = new Double_t[nentries];
  for (Int_t i=0; i<nentries; i++) {
    t1->GetEntry(i);
    XX[i]=X;
  }
}
int main()
{
  TFile *f = new TFile("test_long.root");
  TTree *t1 = (TTree*)f->Get("T");
  tree1r(t1);
  f->Close();
}

compiled with O2

[ul]
time ./reading
[/ul]

999775

real 0m0.512s
user 0m0.472s
sys 0m0.040s

Well, the c++ code is approximately 5 times faster than python.
And this is linked, I think, to the looping speed of python, even though I am not sure about it…

wlav · October 17, 2007, 6:49pm

Hi,

sorry for taking a long time to reply, but I no longer had scipy installed, so I had to redo all of that.

If you want to read only part of the tree (say, one branch), then the slow bit will be “nb=tree.GetEntry(i)” and your alternative is “br=tree.GetBranch(‘X’)” together with “nb=br.GetEntry(i)”. I believe that is what you want when you say “only copy a row”?

Note there is no copying into python: when you do a SetBranchAddress(), there is no iteration in python for filling the data structure, as that is done directly in the memory pointed to and done by ROOT I/O. Reading a subset would more likely slow down the reading rather than speed it up.

Other than that, as mentioned earlier in this thread: tight loops in python will always significantly lag C++. A factor 5 for a tight loop ain’t so bad.

Note that the final factor of speed that is less than an order of magnitude will have to come from python compilation (see our CHEP talk: basically it is about teaching PyPy (think psyco) about PyROOT calls and resolving them to the underlying C++). That’s a mid-term timed project …

Cheers,
Wim

bjrnfrdnnd · November 16, 2007, 8:20pm

No, I think you misunderstood my “copying a row”.
The program I posted does exactly the same thing the C++ program does. Accessing only one single branch would not make a difference in speed when compared to the C++ version. So I think the culprit is clearly the “tight loop”.

What I meant by “copying a column” means exactly that I do not want to iterate over the entries.
To clarify, in the above posted program I fill an array XX with numbers coming from the tree. In order to achieve that, I iterate over all the entries manually (i.e. I “copy” the “column”=“branch” by iterating over every “entry”=“row”):

for (Int_t i=0; i<nentries; i++) { t1->GetEntry(i); XX[i]=X; } )

Now, if there were a function in root which somehow does this manual loop for me, I could write something like

or something like that.

In this case, if such an operation is usable by pyroot, the tight loop I used in my python equivalent:

for i in range(N): nb=tree.GetEntry(i) XX[i]=X
would no longer be necessary to be written explicitly.
In this case, the root function

would take care of iterating over the entries, doing so hopefully in C++ speed and thereby enabling pyroot to read entire “columns”=“branches” of trees to arrays at the same speed than C++ does.

Do you know of any such kind of root function?
By the way, I also experimented with swig and succeeded including a c++ compiled version of such a “read a column by iterating over the entries and return an array” into my python program. This thing returns a C++ pointer to a Double_t in C++ time.
Doesn’t help though, if you want to use it as a python array (or scipy array), you would still need to make a tight loop in python…
So no way of accelerating this thing?
Thanks for your response though…