Handling multiple TGeoNavigators

hi all

i am trying to get a ROOT code run w/ multiple threads (in parallel) . the code deals with tracking particles and i am trying to divide the job between threads but i am unsure of how to deal with TGeoNavigator - should i create one for each thread (is there any way to sync them ?) because multiple threads using a single navigator call like FindNextBoundary give a segmentation violation.

please advise
thanks

Hi,
You will need one navigator per thread, once the remaining part of the geometry will be fully thread safe (probably next week). Calls via gGeoManager will be correctly redirected to the right navigator. Sync-ing is up to you.

You will notice TGeo being fully thread safe when test/stressGeometry will be changed to work in mt. mode
Cheers,

thanks for the reply. will these changes be available as patches for v5-30 or should i switch to trunk ?

as for the syncing part, i am having the following operations:

1.create threads
2.switch current navigator to a different navigator per thread
3.select a track path and reflect it in the navigator using UpdateNavigator
4.set the current point and direction for the navigator
5.check if track starts at boundary
6.FindNextBoundary for each nav
7.GetSafeDistance for each nav

now , i am doubtful whether a track should be reflected in ALL the navigators (step 3) , or can i simply divide the tracks into groups and only update a certain group for each navigator ? will FindNextBoundary be affected by this ?

thanks again.

Hi,
Trunk only.

  1. ok
  2. CREATE one navigator per thread WITHIN the threaded code
  3. No need, trace different tracks in different threads, then you don’t need to care to update any navigator state. Do your algorithms in the thread as you would do it single threaded, just split the workload from the beginning. Just synchronize at the end.

But you will have to wait a bit for this to really work.

Cheers,

hi

i did a trunk build and tried my code - it runs but there seems to be a memory leak - the program uses up all the memory and gets killed. i suppose its because of the AddNavigator , unless it frees up the memory automatically. here’s the piece of code :

   Double_t pdir[3];
   Int_t itr,testVar=0;
   Int_t randomnumber=random();
   Bool_t isOnBoundary = kFALSE;

#pragma omp parallel private(pdir,itr,isOnBoundary) num_threads (2)
   {

     TGeoNavigator *nav;
     nav=gGeoManager->AddNavigator();
     nav->SetOutside(kFALSE);
  
   for (itr=0; itr<ntracks; itr++) {
      GeantTrack *track = gTracks[trackin[itr]];
      track->Direction(pdir);
      TGeoBranchArray *a = gCurrentBasket->GetBranchArray(track);
      a->UpdateNavigator(nav);
      nav->SetCurrentPoint(&track->xpos);
      nav->SetCurrentDirection(pdir);
      isOnBoundary = track->frombdr;
      Double_t pstep = TMath::Min(1.E20, track->pstep);
      nav->FindNextBoundary(pstep,"",isOnBoundary);
      track->safety = nav->GetSafeDistance();
      track->snext = TMath::Max(gTolerance,nav->GetStep());
}
}

also, is there any sync required after this ? because only UpdateNavigator or FindNextBoundary can depend on previous states, and i am storing neither of them.

thanks again

You are a bit faster than the development itself. Anyway, I attached an example, but there are still bottlenecks I am working on so the scaling with the number of threads is very bad.

Cheers,
firsttest2.C (8.56 KB)

hi again

thanks for the upload, it was really helpful.
i made an openmp version of the code and that too worked fine. but when i tried this with my original code, i got a memory leak that eventually killed the program. apparently i have made a particular function multithreaded , and over the course of execution this function is called many times - each time there are threads created ,with calls to AddNavigator, (which perhaps is leaking memory ? although i cannot be certain since i have not tried with a profiler)

could you please clarify whether this is a design feature (ie. i am not supposed to create short-lived threads like in functions) or a bug ?

regards

PS: i am attaching a modified version of the code you uploaded ,which crashes on my system (4GB RAM).
ft3omp.cxx (4.83 KB)

Hi,

You are supposed to reuse the navigators because they are big objects that you don’t want to create/delete every time. You should have one navigator per thread, created within the thread via the logic:

If you need to kill and then recreate a thread (which you may want to avoid) you will have first to synchronize all navigation threads then call TGeoManager::ClearThreadsMap(). If you really want to clean navigators in between (not advisable) you can do it via TGeoManager::ClearNavigators().

Some parts of the implementation may still have flaws, if you discover some please let me know. I am also doing tests on this now and things may improve in the process.

Cheers,

hi

i re-did some of the code but there still seems to be a problem. i did some investigation but i think i need help with this:

i create multiple threads in my function (as described in prev. posts) , that calls TGeoNavigator::FindNextBoundary. but after some iterations, this program hangs. i checked the status of the threads in gdb:

code info threads
53 Thread 0x7fffe7e66700 (LWP 12692) __lll_lock_wait ()
at …/nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
52 Thread 0x7fffe8667700 (LWP 12691) __lll_lock_wait ()
at …/nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
51 Thread 0x7fffe8e68700 (LWP 12690) __lll_lock_wait ()
at …/nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
2 Thread 0x7fffe9669700 (LWP 12641) __lll_lock_wait ()
at …/nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136

  • 1 Thread 0x7ffff7fbe7e0 (LWP 12640) futex_wait (bar=0x1049fa50, state=0)
    at …/…/…/src/libgomp/config/linux/x86/futex.h:44
    [/code]

the backtrace of the threads reveals that thread 1 has completed all its iterations and is waiting, while the other threads are still on their first iteration waiting for the lock in GetThreadData :

#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136 #1 0x00007fffefcfd5cf in _L_lock_1005 () from /lib/x86_64-linux-gnu/libpthread.so.0 #2 0x00007fffefcfd42b in __pthread_mutex_lock (mutex=0x5b37b60) at pthread_mutex_lock.c:82 #3 0x00007ffff0ae6c3c in TPosixMutex::Lock (this=0x5b37b50) at /opt/alice/root/core/thread/src/TPosixMutex.cxx:75 #4 0x00007ffff0ae128d in TMutex::Lock (this=0x5b19030) at /opt/alice/root/core/thread/src/TMutex.cxx:48 #5 0x00007ffff0ae4122 in TThread::Lock () at /opt/alice/root/core/thread/src/TThread.cxx:698 #6 0x00007ffff6921796 in TGeoManager::GetThreadData (this=0xe5edd0) at /opt/alice/root/geom/geom/src/TGeoManager.cxx:353 #7 0x00007ffff6931bd1 in TGeoManager::GetDblBuffer (this=0xe5edd0, length=8) at /opt/alice/root/geom/geom/src/TGeoManager.cxx:3649 #8 0x00007ffff6978078 in TGeoPgon::DistFromInside (this=0xe6e660, point=0x7fffe9668c90, dir=0x7fffe9668c70, iact=3, step=5.0148226158261151, safe=0x7fffe9668cf8) at /opt/alice/root/geom/geom/src/TGeoPgon.cxx:351 #9 0x00007ffff694ea43 in TGeoNavigator::FindNextBoundary (this=0xafa28e0, stepmax=5.0148226158261151, path=0x415000 "", frombdr=true) at /opt/alice/root/geom/geom/src/TGeoNavigator.cxx:716

this happens when the Shape is ALIC and does not happen with TGeoTube.

can you please help me with what might be wrong here ?

thanks

Hi,

Looks like a deadlock. Can you attach your code ?

Regards,

hi

i have attached the file. the relevant function starts from Line 1192. i use g++ -I$ROOTSYS/include root-config --evelibs -fopenmp -ggdb -w pg.cxx to compile/link
pg.cxx (78.5 KB)

Hi,

I am trying to make work the same piece of code (which is our new transport test prototype). I am fighting with the same kind of locks. There is a fresh fix in the trunk for this one in TGeoBranchArray class. Please try with that and let me know the progress. If you find something I am interested.

Cheers,

hi

the trunk fix works. now there is no more deadlock , although i often (not always) get segmentation fault in TGeoVoxelFinder. even more , they occur at different points :

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe9669700 (LWP 21270)]
0x00007ffff69cfcb3 in TGeoVoxelFinder::IntersectAndStore (this=0x3006e70, 
    array1=0x300a312 "", array2=0x300a700 " ", 
    array3=0x5300aaa0 <Address 0x5300aaa0 out of bounds>, tid=3)
    at /opt/alice/root/geom/geom/src/TGeoVoxelFinder.cxx:1798
1798	      byte = array1[current_byte] & array2[current_byte] & array3[current_byte];
(gdb) bt
#0  0x00007ffff69cfcb3 in TGeoVoxelFinder::IntersectAndStore (this=0x3006e70, 
    array1=0x300a312 "", array2=0x300a700 " ", 
    array3=0x5300aaa0 <Address 0x5300aaa0 out of bounds>, tid=3)
    at /opt/alice/root/geom/geom/src/TGeoVoxelFinder.cxx:1798
#1  0x00007ffff69ce6b5 in TGeoVoxelFinder::SortCrossedVoxels (this=0x3006e70, 
    point=0x7fffe9668c90, dir=0x7fffe9668c70, tid=3)
    at /opt/alice/root/geom/geom/src/TGeoVoxelFinder.cxx:1355
#2  0x00007ffff6950279 in TGeoNavigator::FindNextDaughterBoundary (this=0x9a89c10, 
    point=0x7fffe9668c90, dir=0x7fffe9668c70, idaughter=@0x7fffe9668d3c, 
    compmatrix=false) at /opt/alice/root/geom/geom/src/TGeoNavigator.cxx:1029
#3  0x00007ffff694eb81 in TGeoNavigator::FindNextBoundary (this=0x9a89c10, 
    stepmax=322.89055489012873, path=0x414f40 "", frombdr=true)
    at /opt/alice/root/geom/geom/src/TGeoNavigator.cxx:731

[code]Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe8667700 (LWP 20420)]
0x00007ffff69ce35d in TGeoVoxelFinder::SortCrossedVoxels (this=0x2befaa0,
point=0x7fffe8666950, dir=0x7fffe8666a20, tid=6)
at /opt/alice/root/geom/geom/src/TGeoVoxelFinder.cxx:1295
1295 td.fLimits[2] = (fZb[0]-point[2])*td.fInvdir[2];
(gdb) bt
#0 0x00007ffff69ce35d in TGeoVoxelFinder::SortCrossedVoxels (this=0x2befaa0,
point=0x7fffe8666950, dir=0x7fffe8666a20, tid=6)
at /opt/alice/root/geom/geom/src/TGeoVoxelFinder.cxx:1295
#1 0x00007ffff698980f in TGeoShapeAssembly::DistFromOutside (this=0x2befa00,
point=0x7fffe8666a40, dir=0x7fffe8666a20, iact=3, step=0.50528924349332627,
safe=0x0) at /opt/alice/root/geom/geom/src/TGeoShapeAssembly.cxx:290
#2 0x00007ffff69898e4 in TGeoShapeAssembly::DistFromOutside (this=0x2c8dd80,
point=0x7fffe8666b30, dir=0x7fffe8666b10, iact=3, step=6.6682673272774107,
safe=0x0) at /opt/alice/root/geom/geom/src/TGeoShapeAssembly.cxx:296
#3 0x00007ffff69503f6 in TGeoNavigator::FindNextDaughterBoundary (this=0xd9d4280,
point=0x7fffe8666c90, dir=0x7fffe8666c70, idaughter=@0x7fffe8666d3c,
compmatrix=false) at /opt/alice/root/geom/geom/src/TGeoNavigator.cxx:1038
#4 0x00007ffff694eb81 in TGeoNavigator::FindNextBoundary (this=0xd9d4280,
stepmax=6.6682673272774107, path=0x414f40 “”, frombdr=true)
at /opt/alice/root/geom/geom/src/TGeoNavigator.cxx:731
#5 0x0000000000410ec5 in GeantVolumeBasket::ComputeTransportLength (
.omp_data_i=0x7fffffffd3e0) at pg.cxx:1228

[/code]

0x00007ffff69cb777 in TGeoVoxelFinder::GetExtraZ (this=0x3006e70, islice=1079492608, 
    left=true, nextra=@0x7fffe8e67b90)
    at /opt/alice/root/geom/geom/src/TGeoVoxelFinder.cxx:595
595	      nextra = fExtraZ[fOEz[islice]];
(gdb) bt
#0  0x00007ffff69cb777 in TGeoVoxelFinder::GetExtraZ (this=0x3006e70, 
    islice=1079492608, left=true, nextra=@0x7fffe8e67b90)
    at /opt/alice/root/geom/geom/src/TGeoVoxelFinder.cxx:595
#1  0x00007ffff69cdadd in TGeoVoxelFinder::GetNextCandidates (this=0x3006e70, 
    point=0x7fffe8e67c90, ncheck=@0x7fffe8e67b90, tid=5)
    at /opt/alice/root/geom/geom/src/TGeoVoxelFinder.cxx:1184
#2  0x00007ffff69cf1b7 in TGeoVoxelFinder::GetNextVoxel (this=0x3006e70, 
    point=0x7fffe8e67c90, ncheck=@0x7fffe8e67b90, tid=5)
    at /opt/alice/root/geom/geom/src/TGeoVoxelFinder.cxx:1544
#3  0x00007ffff69506d0 in TGeoNavigator::FindNextDaughterBoundary (this=0xc4bb5b0, 
    point=0x7fffe8e67c90, dir=0x7fffe8e67c70, idaughter=@0x7fffe8e67d3c, 
    compmatrix=false) at /opt/alice/root/geom/geom/src/TGeoNavigator.cxx:1030
#4  0x00007ffff694eb81 in TGeoNavigator::FindNextBoundary (this=0xc4bb5b0, 
    stepmax=6.6682673272774107, path=0x414f40 "", frombdr=true)
    at /opt/alice/root/geom/geom/src/TGeoNavigator.cxx:731
#5  0x0000000000410ec5 in GeantVolumeBasket::ComputeTransportLength (
    .omp_data_i=0x7fffffffd3e0) at pg.cxx:1228

thanks

hi

any updates on this ? i am still getting segmentation faults in TGeoVoxelFinder :

Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffe701b700 (LWP 25909)] 0x00007ffff69cd87d in TGeoVoxelFinder::SortCrossedVoxels (this=0x2c9f6d0, point=0x7fffe701aa60, dir=0x7fffe701ab10, tid=0) at /opt/alice/root/geom/geom/src/TGeoVoxelFinder.cxx:1240
i am using the latest trunk build
thanks

Hi,

No, I had no time to work on this lately - I will let you know when I have some updates.

Regards,

Hi,
This should be now fixed in the trunk.

Cheers,

hi
now it seems to be working fine - thanks.