I’m looking for a solution for using TRefs in a multithreaded environment with (many) parallel threads. The TRef class documentation states:
To avoid a growing table of fObjects in TProcessID, in case, for example, one processes many events in a loop, it might be necessary to reset the ObjectNumber at the end of processing of one event. ROOT: TRef Class Reference
However, this is not easily possible because other threads have objects in that storage, too. So as I see it, the only way would be to raise a global mutex, wait for all processes to finish and then reset the count. This is - for obvious reasons - very inefficient and I would like to avoid that.
If I leave this running, however, at some point I overflow the object reference table and a new TProcessID is created. There can only be 255 processes in that normal table (ROOT: core/base/src/TProcessID.cxx Source File).
At this point, TExMap takes over storing IDs - which is where it becomes tricky: with enough threads running, ROOT can actually run into problems because it doesn’t expect to see objects from the TProcessID table anymore but assumes they are all in TExMap - and fails at recursive delete, here: ROOT: core/base/src/TProcessID.cxx Source File (l417)
So my question is: how can we safely reset the object count in a multithreaded environment without halting all threads?
Maybe some additional information: all TObjects we create are thread-local only and not shared among threads. While in principle this greatly simplifies the situation - it doesn’t because there is no way I can tell ROOT about this.
Maybe, as one option, would it me possible to use the TExMap directly from the start? At least then we don’t fall into the table-switching-invalidates-pointers trap.
So I guess there is no better way? What is the size limit on the TExMap system then?
How does it fail?
As soon as one object is allocated in the TExMap system, ROOT is lead to believe that all objects are present there and upon deletion try to remove them from that place, see:
The problem here is that it doesn’t check if the object actually resides in the table but simply attempts to delete it - which leads to a segfault upon destruction of the object if it actually is accounted for in the TProcessID table system.
#0 0x00007fffe6266180 in TObject::Error(char const*, char const*, ...) const () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so
#1 0x00007fffe626ffbf in TProcessID::RecursiveRemove(TObject*) () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so
#2 0x00007fffe62de540 in TObjArray::RecursiveRemove(TObject*) () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so
#3 0x00007fffe62cfe4c in THashList::RecursiveRemove(TObject*) () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so
#4 0x00007fffe6232822 in TROOT::RecursiveRemove(TObject*) () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so
#5 0x00007fffe6264062 in TObject::~TObject() () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so
So the minimum fix would be adding checks:
if (obj == GetObjectWithID(uid)) {
if (fgObjPIDs) {
if ((obj->GetUniqueID()&0xff000000) == 0xff000000) {
ULong64_t hash = Void_Hash(obj);
fgObjPIDs->Remove(hash,(Long64_t)obj);
}
}
(*fObjects)[uid] = 0; // Avoid recalculation of fLast (compared to ->RemoveAt(uid))
}
but that of course doesn’t actually release the memory.
RecursiveRemove should not be (ever) attempting to delete the object, just to inform that the object is being deleted and should be removed from the internal state (the hast list).
#0 0x00007fffe6266180 in TObject::Error(char const*, char const*, ...) const () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so
#1 0x00007fffe626ffbf in TProcessID::RecursiveRemove(TObject*) () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so
is “surprising” as I do not see any call to Error in TProcessID::RecursiveRemove … so I am missing something.
sorry, my mistake, with the deletion I did not refer to the actual object but to the hash list entry I think. The actual objects are properly being taken care of by our code. But the RemoveRecursive attempts to delete a hash table entry that does not (and never has) exist - because it is in the other table where it has been initially created.
I’m not sure if I’m able to produce a standalone code to run into this issue without investing some significant time.
Do you think the “calling TProcessID::AddProcessID() until we overflow” is a sensible approach to follow for a production code?