Resetting TProcessID ObjectCount in Multihtreading Environment

simonspa · April 15, 2020, 1:48pm

Dear all,

I’m looking for a solution for using TRefs in a multithreaded environment with (many) parallel threads. The TRef class documentation states:

To avoid a growing table of fObjects in TProcessID, in case, for example, one processes many events in a loop, it might be necessary to reset the ObjectNumber at the end of processing of one event.
ROOT: TRef Class Reference

However, this is not easily possible because other threads have objects in that storage, too. So as I see it, the only way would be to raise a global mutex, wait for all processes to finish and then reset the count. This is - for obvious reasons - very inefficient and I would like to avoid that.

If I leave this running, however, at some point I overflow the object reference table and a new TProcessID is created. There can only be 255 processes in that normal table (ROOT: core/base/src/TProcessID.cxx Source File).

At this point, TExMap takes over storing IDs - which is where it becomes tricky: with enough threads running, ROOT can actually run into problems because it doesn’t expect to see objects from the TProcessID table anymore but assumes they are all in TExMap - and fails at recursive delete, here: ROOT: core/base/src/TProcessID.cxx Source File (l417)

So my question is: how can we safely reset the object count in a multithreaded environment without halting all threads?

Thanks for advice!
Simon

jblomer · April 16, 2020, 6:43am

@pcanal @eguiraud can you help?

simonspa · April 16, 2020, 6:47am

Maybe some additional information: all TObjects we create are thread-local only and not shared among threads. While in principle this greatly simplifies the situation - it doesn’t because there is no way I can tell ROOT about this.

simonspa · April 20, 2020, 8:48am

Maybe, as one option, would it me possible to use the TExMap directly from the start? At least then we don’t fall into the table-switching-invalidates-pointers trap.

pcanal · April 20, 2020, 7:45pm

Maybe, as one option, would it me possible to use the TExMap directly from the start?

There might be a way indeed. If you call TProcessID::AddProcessID() 256 times, it should trick the system into the TExMap mode.

but assumes they are all in TExMap - and fails at recursive delete, here:

How does it fail?

simonspa · April 21, 2020, 5:52am

it should trick the system into the TExMap mode.

So I guess there is no better way? What is the size limit on the TExMap system then?

How does it fail?

As soon as one object is allocated in the TExMap system, ROOT is lead to believe that all objects are present there and upon deletion try to remove them from that place, see:

The problem here is that it doesn’t check if the object actually resides in the table but simply attempts to delete it - which leads to a segfault upon destruction of the object if it actually is accounted for in the TProcessID table system.

#0  0x00007fffe6266180 in TObject::Error(char const*, char const*, ...) const () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so
#1  0x00007fffe626ffbf in TProcessID::RecursiveRemove(TObject*) () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so
#2  0x00007fffe62de540 in TObjArray::RecursiveRemove(TObject*) () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so
#3  0x00007fffe62cfe4c in THashList::RecursiveRemove(TObject*) () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so
#4  0x00007fffe6232822 in TROOT::RecursiveRemove(TObject*) () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so
#5  0x00007fffe6264062 in TObject::~TObject() () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so

So the minimum fix would be adding checks:

if (obj == GetObjectWithID(uid)) {
    if (fgObjPIDs) {
        if ((obj->GetUniqueID()&0xff000000) == 0xff000000) {
            ULong64_t hash = Void_Hash(obj);
            fgObjPIDs->Remove(hash,(Long64_t)obj);
        }
    }
    (*fObjects)[uid] = 0; // Avoid recalculation of fLast (compared to ->RemoveAt(uid))
}

but that of course doesn’t actually release the memory.

pcanal · April 21, 2020, 7:10pm

RecursiveRemove should not be (ever) attempting to delete the object, just to inform that the object is being deleted and should be removed from the internal state (the hast list).

#0  0x00007fffe6266180 in TObject::Error(char const*, char const*, ...) const () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so
#1  0x00007fffe626ffbf in TProcessID::RecursiveRemove(TObject*) () from /cvmfs/sft.cern.ch/lcg/views/LCG_96b/x86_64-centos7-clang8-opt/lib/libCore.so

is “surprising” as I do not see any call to Error in TProcessID::RecursiveRemove … so I am missing something.

Could you provide a standalone reproducer?

Thanks,
Philippe.

simonspa · April 22, 2020, 6:54am

Hi @pcanal

sorry, my mistake, with the deletion I did not refer to the actual object but to the hash list entry I think. The actual objects are properly being taken care of by our code. But the RemoveRecursive attempts to delete a hash table entry that does not (and never has) exist - because it is in the other table where it has been initially created.

I’m not sure if I’m able to produce a standalone code to run into this issue without investing some significant time.

Do you think the “calling TProcessID::AddProcessID() until we overflow” is a sensible approach to follow for a production code?

Simon

pcanal · April 23, 2020, 7:31pm

Do you think the “calling TProcessID::AddProcessID() until we overflow” is a sensible approach to follow for a production code?

Not really but pending an understanding of the problem (and possibly a solution) this seems to be a workable workaround.

pcanal · April 23, 2020, 8:03pm

If the problem is ‘just’ the error message. The PR: https://github.com/root-project/root/pull/5446
should solve the problem.

Cheers,
Philippe.

system · May 7, 2020, 8:03pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.