I have a custom histogram merging script, because hadd doesn’t quite cut it, for a variety of reasons*. I found that my script was extremely slow, and I believe I have located the reason:
In [1]: import ROOT as R
In [2]: lots_of_keys = [R.TKey() for i in xrange(1000)]
In [3]: time del lots_of_keys
CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
Wall time: 0.01 s
In [5]: lots_of_keys = [R.TKey() for i in xrange(5000)]
In [6]: time del lots_of_keys
CPU times: user 0.21 s, sys: 0.00 s, total: 0.21 s
Wall time: 0.21 s
In [8]: lots_of_keys = [R.TKey() for i in xrange(10000)]
In [9]: time del lots_of_keys
CPU times: user 1.21 s, sys: 0.00 s, total: 1.21 s
Wall time: 1.21 s
In [11]: lots_of_keys = [R.TKey() for i in xrange(20000)]
In [12]: time del lots_of_keys
CPU times: user 5.82 s, sys: 0.00 s, total: 5.82 s
Wall time: 5.82 s
In [14]: lots_of_keys = [R.TKey() for i in xrange(40000)]
In [15]: time del lots_of_keys
CPU times: user 37.73 s, sys: 0.01 s, total: 37.74 s
Wall time: 37.75 s
So it seems that when keys get garbage collected, we spend lots of time deleting them!
Looking at:
root.cern.ch/root/html/src/TKey.cxx.html#gCb_UB
I see that TKey chose to redefine it’s Delete method to do the key deletion from the current file. As such, the TKey::Delete method calls “fMotherDir->GetListOfKeys()->Remove(this);”, which I believe could be the origin of the quadratic behaviour for deleting lots of TKeys. If this hypothesis is true, then I need a way to correctly delete these keys without the undesired behaviour. If it is false, then does anyone else have any ideas what I’m doing wrong?
I have tried experimenting with ways of deleting the keys manually but I haven’t found anything that works yet**. I was tempted to SetOwnership(False) on the keys and then delete them on the C++ side, but I would rather not have extra C++ code floating around. I would prefer a self-contained script. Is there any way to delete these objects properly from pure python?
- In particular, it doesn’t support merging lots of different types, like TParameter and THnSparse. Also, it assumes that all of the histograms you want to merge exist on the first input file, which is not true for my set of files.
** I tried using TClass::GetDelete, then trying to obtain a pointer to the buffer (which I think one has to do by parsing the repr?!) then calling it with ctypes, but I couldn’t get this to do anything other than segfault. I also wondered about trying to use the TObject::Delete method on my TKey object but don’t know whether this makes sense or is possible from PyROOT.