Proof loading

Dear Proof Experts,
I’m running an ATLAS Analysis on D3PD on a small PROOF Cluster made by four 16 core machines. Installation is being used since long time and works fine when used with other analysis.
When testing this analysis, which requires a couple of .so library to be loaded, we run in the following problem: in local mode (no PROOF) and with PROOFLite everything goes fine and the analysis works without problems. As soon as I try to run on the full cluster the analysis get stuck because of a segmentation violation while loading the first of the two libraries.

the loading of the libraries is done this way

p = TProof::Open("lx30","workers=64"); p->SetParameter("PROOF_MaxSlavesPerNode", 100); p->Load("/einstein2/edo/D3PDAna_rel16_p543_test/GoodRunsLists-00-00-91/StandAlone/libGoodRunsLists.so"); p->Load("/einstein2/edo/D3PDAna_rel16_p543_test/SUSYTools/StandAlone/libSUSYTools.so"); p->Load("D3PD.C+"); p->Process(dataset,"D3PD.C+");

log file of the last-master-session

[code]110531 15:51:10 11687 xpd-I: ProofServMgr::Create: srvtype = 2
15:52:01 16695 Mst-0 | Info in TXProofServ::HandleCache: loading macro libGoodRunsLists.so
15:52:01 16695 Mst-0 | *** Break ***: segmentation violation

| session: gorini.default.17848.status terminated by peer
15:52:01 16695 Mst-0 | Info in TXSlave::HandleError: 0x1ef3a450:lx32.le.infn.it:0.2 got called … fProof: 0x1ee75aa0, fSocket: 0x1ef3a820 (valid: 1)
15:52:01 16695 Mst-0 | Info in TXSlave::HandleError: 0x1ef3a450: proof: 0x1ee75aa0

| session: gorini.default.12868.status terminated by peer
15:52:01 16695 Mst-0 | Info in TXSlave::HandleError: 0x1ef38830:lx31.le.infn.it:0.1 got called … fProof: 0x1ee75aa0, fSocket: 0x1ef38c00 (valid: 1)
15:52:01 16695 Mst-0 | Info in TXSlave::HandleError: 0x1ef38830: proof: 0x1ee75aa0

| session: gorini.default.26882.status terminated by peer
15:52:01 16695 Mst-0 | Info in TXSlave::HandleError: 0x1ef3c540:lx33.le.infn.it:0.3 got called … fProof: 0x1ee75aa0, fSocket: 0x1ef3c960 (valid: 1)
15:52:01 16695 Mst-0 | Info in TXSlave::HandleError: 0x1ef3c540: proof: 0x1ee75aa0

===========================================================
There was a crash.
This is the entire stack trace of all threads:

Thread 7 (Thread 0x417d3940 (LWP 16696)):
#0 0x00000039fdecb696 in poll () from /lib64/libc.so.6
#1 0x00002afd7bf5f8d6 in XrdClientSock::RecvRaw (this=0x1edbde40,
buffer=0x1ef7a930, length=8, substreamid=-1, usedsubstreamid=0x1)
at XrdClientSock.cc:128
#2 0x00002afd7bf820fa in XrdClientPhyConnection::ReadRaw (this=0x1edbc5c0,
buf=0x1ef7a930, len=8, substreamid=-1, usedsubstreamid=0x417d292c)
at XrdClientPhyConnection.cc:362
#3 0x00002afd7bf7e213 in XrdClientMessage::ReadRaw (this=0x1ef7a8f0,
phy=0x1edbc5c0) at XrdClientMessage.cc:152
#4 0x00002afd7bf8140f in XrdClientPhyConnection::BuildMessage (
this=0x1edbc5c0, IgnoreTimeouts=true, Enqueue=true)
at XrdClientPhyConnection.cc:443
#5 0x00002afd7bf86226 in SocketReaderThread (arg=0x1edbc5c0,
thr=) at XrdClientPhyConnection.cc:61
#6 0x00002afd7bd1ce81 in XrdSysThread_Xeq ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libProofx.so
#7 0x00000039fea0673d in start_thread () from /lib64/libpthread.so.0
#8 0x00000039fded44bd in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x40aa8940 (LWP 16704)):
#0 0x00000039fde9a541 in nanosleep () from /lib64/libc.so.6
#1 0x00000039fde9a364 in sleep () from /lib64/libc.so.6
#2 0x00002afd7bf79743 in GarbageCollectorThread (arg=0x1ef1d3c0,
thr=) at XrdClientConnMgr.cc:73
#3 0x00002afd7bd1ce81 in XrdSysThread_Xeq ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libProofx.so
#4 0x00000039fea0673d in start_thread () from /lib64/libpthread.so.0
#5 0x00000039fded44bd in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x421d4940 (LWP 16705)):
#0 0x00000039fdecb696 in poll () from /lib64/libc.so.6
#1 0x00002afd7bf5f8d6 in XrdClientSock::RecvRaw (this=0x1ef21b70,
buffer=0x1f05a9e0, length=8, substreamid=-1, usedsubstreamid=0x1)
at XrdClientSock.cc:128
#2 0x00002afd7bf820fa in XrdClientPhyConnection::ReadRaw (this=0x1ef20390,
buf=0x1f05a9e0, len=8, substreamid=-1, usedsubstreamid=0x421d392c)
at XrdClientPhyConnection.cc:362
#3 0x00002afd7bf7e213 in XrdClientMessage::ReadRaw (this=0x1f05a9a0,
phy=0x1ef20390) at XrdClientMessage.cc:152
#4 0x00002afd7bf8140f in XrdClientPhyConnection::BuildMessage (
this=0x1ef20390, IgnoreTimeouts=true, Enqueue=true)
at XrdClientPhyConnection.cc:443
#5 0x00002afd7bf86226 in SocketReaderThread (arg=0x1ef20390,
thr=) at XrdClientPhyConnection.cc:61
#6 0x00002afd7bd1ce81 in XrdSysThread_Xeq ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libProofx.so
#7 0x00000039fea0673d in start_thread () from /lib64/libpthread.so.0
#8 0x00000039fded44bd in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x42bd5940 (LWP 16706)):
#0 0x00000039fdecb696 in poll () from /lib64/libc.so.6
#1 0x00002afd7bf5f8d6 in XrdClientSock::RecvRaw (this=0x1ef24660,
buffer=0x1ef7a7a0, length=8, substreamid=-1, usedsubstreamid=0x1)
at XrdClientSock.cc:128
#2 0x00002afd7bf820fa in XrdClientPhyConnection::ReadRaw (this=0x1ef22e60,
buf=0x1ef7a7a0, len=8, substreamid=-1, usedsubstreamid=0x42bd492c)
at XrdClientPhyConnection.cc:362
#3 0x00002afd7bf7e213 in XrdClientMessage::ReadRaw (this=0x1ef7a760,
phy=0x1ef22e60) at XrdClientMessage.cc:152
#4 0x00002afd7bf8140f in XrdClientPhyConnection::BuildMessage (
this=0x1ef22e60, IgnoreTimeouts=true, Enqueue=true)
at XrdClientPhyConnection.cc:443
#5 0x00002afd7bf86226 in SocketReaderThread (arg=0x1ef22e60,
thr=) at XrdClientPhyConnection.cc:61
#6 0x00002afd7bd1ce81 in XrdSysThread_Xeq ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libProofx.so
#7 0x00000039fea0673d in start_thread () from /lib64/libpthread.so.0
#8 0x00000039fded44bd in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x435d6940 (LWP 16708)):
#0 0x00000039fdecb696 in poll () from /lib64/libc.so.6
#1 0x00002afd7bf5f8d6 in XrdClientSock::RecvRaw (this=0x1ef2fd00,
buffer=0x2aaaac0020a0, length=8, substreamid=-1, usedsubstreamid=0x1)
at XrdClientSock.cc:128
#2 0x00002afd7bf820fa in XrdClientPhyConnection::ReadRaw (this=0x1ef2e500,
buf=0x2aaaac0020a0, len=8, substreamid=-1, usedsubstreamid=0x435d592c)
at XrdClientPhyConnection.cc:362
#3 0x00002afd7bf7e213 in XrdClientMessage::ReadRaw (this=0x2aaaac002060,
phy=0x1ef2e500) at XrdClientMessage.cc:152
#4 0x00002afd7bf8140f in XrdClientPhyConnection::BuildMessage (
this=0x1ef2e500, IgnoreTimeouts=true, Enqueue=true)
at XrdClientPhyConnection.cc:443
#5 0x00002afd7bf86226 in SocketReaderThread (arg=0x1ef2e500,
thr=) at XrdClientPhyConnection.cc:61
#6 0x00002afd7bd1ce81 in XrdSysThread_Xeq ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libProofx.so
#7 0x00000039fea0673d in start_thread () from /lib64/libpthread.so.0
#8 0x00000039fded44bd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x43fd7940 (LWP 16712)):
#0 0x00000039fdecb696 in poll () from /lib64/libc.so.6
#1 0x00002afd7bf5f8d6 in XrdClientSock::RecvRaw (this=0x1ef32c20,
buffer=0x1ef7d3b0, length=8, substreamid=-1, usedsubstreamid=0x1)
at XrdClientSock.cc:128
#2 0x00002afd7bf820fa in XrdClientPhyConnection::ReadRaw (this=0x1ef31420,
buf=0x1ef7d3b0, len=8, substreamid=-1, usedsubstreamid=0x43fd692c)
at XrdClientPhyConnection.cc:362
#3 0x00002afd7bf7e213 in XrdClientMessage::ReadRaw (this=0x1ef7d370,
phy=0x1ef31420) at XrdClientMessage.cc:152
#4 0x00002afd7bf8140f in XrdClientPhyConnection::BuildMessage (
this=0x1ef31420, IgnoreTimeouts=true, Enqueue=true)
at XrdClientPhyConnection.cc:443
#5 0x00002afd7bf86226 in SocketReaderThread (arg=0x1ef31420,
thr=) at XrdClientPhyConnection.cc:61
#6 0x00002afd7bd1ce81 in XrdSysThread_Xeq ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libProofx.so
#7 0x00000039fea0673d in start_thread () from /lib64/libpthread.so.0
#8 0x00000039fded44bd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2afd7a5693e0 (LWP 16695)):
#0 0x00000039fde9a14f in waitpid () from /lib64/libc.so.6
#1 0x00000039fde3c481 in do_system () from /lib64/libc.so.6
#2 0x00000039fde3c7d7 in system () from /lib64/libc.so.6
#3 0x00002afd78da5363 in TUnixSystem::StackTrace ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libCore.so
#4 0x00002afd78da1bea in TUnixSystem::DispatchSignals ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libCore.so
#5
#6 0x00000039fda0a492 in _dl_relocate_object ()
from /lib64/ld-linux-x86-64.so.2
#7 0x00000039fda10e21 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#8 0x00000039fda0cf56 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#9 0x00000039fda1070c in _dl_open () from /lib64/ld-linux-x86-64.so.2
#10 0x00000039fe600f9a in dlopen_doit () from /lib64/libdl.so.2
#11 0x00000039fda0cf56 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#12 0x00000039fe60150d in _dlerror_run () from /lib64/libdl.so.2
#13 0x00000039fe600f11 in dlopen

GLIBC_2.2.5 () from /lib64/libdl.so.2
#14 0x00002afd796c706e in G__dlopen ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libCint.so
#15 0x00002afd796c77ab in G__shl_load ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libCint.so
#16 0x00002afd79636618 in G__loadfile ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libCint.so
#17 0x00002afd7967fed3 in G__reloadfile ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libCint.so
#18 0x00002afd79684495 in G__process_cmd ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libCint.so
#19 0x00002afd78d5e03f in TCint::ProcessLine ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libCore.so
#20 0x00002afd78cbf8bc in TApplication::ProcessLine ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libCore.so
#21 0x00002afd78d072d8 in TROOT::ProcessLine ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libCore.so
#22 0x00002afd7b9a40a6 in TProofServ::HandleCache ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libProof.so
#23 0x00002afd7b9aa8a6 in TProofServ::HandleSocketInput ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libProof.so
#24 0x00002afd7b995e48 in TProofServ::HandleSocketInput ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libProof.so
#25 0x00002afd7bcebf33 in TXProofServ::HandleInput ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libProofx.so
#26 0x00002afd7bcfb6f2 in TXSocketHandler::Notify ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libProofx.so
#27 0x00002afd7bcfba6d in TXSocketHandler::ReadNotify ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libProofx.so
#28 0x00002afd78d9df93 in TUnixSystem::CheckDescriptors ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libCore.so
#29 0x00002afd78da2219 in TUnixSystem::DispatchOneEvent ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libCore.so
#30 0x00002afd78d1d2f5 in TSystem::InnerLoop ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libCore.so
#31 0x00002afd78d1d0aa in TSystem::Run ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libCore.so
#32 0x00002afd78cbf90f in TApplication::Run ()
from /cern/root_v5.27.04.Linux-slc5_amd64-gcc3.4/lib/libCore.so
#33 0x00000000004018fd in main ()

The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.

#6 0x00000039fda0a492 in _dl_relocate_object ()
from /lib64/ld-linux-x86-64.so.2
#7 0x00000039fda10e21 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#8 0x00000039fda0cf56 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#9 0x00000039fda1070c in _dl_open () from /lib64/ld-linux-x86-64.so.2
#10 0x00000039fe600f9a in dlopen_doit () from /lib64/libdl.so.2
#11 0x00000039fda0cf56 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#12 0x00000039fe60150d in _dlerror_run () from /lib64/libdl.so.2
#13 0x00000039fe600f11 in dlopen

GLIBC_2.2.5 () from /lib64/libdl.so.2

15:52:02 16695 Mst-0 | Error in TXProofServ::HandleException: exception triggered by signal: 1
15:53:02 16695 Mst-0 | Warning in TXSocket::Close: could not hold semaphore for async messages after 60 sec: closing anyhow (may give error messages)
15:54:02 16695 Mst-0 | Warning in TXSocket::Close: could not hold semaphore for async messages after 60 sec: closing anyhow (may give error messages)
[/code]

Root version is 5.27/04 and it’s the same on all machines. Any Idea on what I’m doing wrong ?

thanks,

Edoardo

Dear Edoardo,

First of all sorry for the late reply.

The method TProof::Load() is meant to load small things like macros or classes and does not work with libraries (see also root.cern.ch/drupal/content/load … o-or-class).
To load a library, or more generally a package, you can use a PAR file (see root.cern.ch/drupal/content/work … -par-files).
In your case it looks like the libraries are in directories seen by the workers ("/einstein2/" is a shared volume, right? ); if this is true then you may use TProof::Exec to load them:

root[] p->Exec("gSystem->Load(\"/einstein2/edo/D3PDAna_rel16_p543_test/GoodRunsLists-00-00-91/StandAlone/libGoodRunsLists.so\")");
root[] p->Exec("gSystem->Load(\"/einstein2/edo/D3PDAna_rel16_p543_test/SUSYTools/StandAlone/libSUSYTools.so\")");

If this is not the case, then you have to do a simple PAR file, e.g. myLibs, with the two libraries in the main directory, e.g.

and in myLIBS/PROOF-INF/SETUP.C the instructions to load them:

int SETUP()
{
     gSystem->Load("libGoodRunsLists.so");
     gSystem->Load("libSUSYTools.so");

     return 0;
}

The main directory ‘myLibs’ is also the place where to put include files that your selector may need.

Hope it helps.

Gerri Ganis