Reoccuring communication failure

Hi,
I’m having the following error consistently - if I run the same query 5 times against our small PROOF cluster, it will fail 4 times with the following log. And then it hangs at the end - and the hang will last close to 5 minutes (a timeout somewhere in the system).

I’m not sure how to debug this - the query works sometimes, and fails other times… The server is running 5.32.02, and the client is running 5.30.01 (client on Windows, server on Linux).

Cheers,
Gordon.

Starting master: opening connection ... Starting master: OK Opening connections to workers: OK (36 workers) Setting up worker servers: OK (36 workers) PROOF set to parallel mode (36 workers) xxxx.phys.washington.edu: stat: cannot stat `/phys/groups/tev/scratch4/users/proofbox/gwatts/session-xxxx-1337406296-1779/master-0-xxxx-1337406296-1779/ntuple_CollectionTree.h': No such file or directory [PutFile] Total 0.02 MB |====================| 100.00 % [0.4 MB/s] xxxx.phys.washington.edu: stat: cannot stat `/phys/groups/tev/scratch4/users/proofbox/gwatts/session-xxxx-1337406296-1779/master-0-xxxx-1337406296-1779/junk_macro_parsettree_CollectionTree.C': No such file or director y [PutFile] Total 0.00 MB |====================| 100.00 % [0.0 MB/s] Info in <TWinNTSystem::ACLiC>: creating shared library C:\Users\gwatts\AppData\Local\Temp\LINQToTTree\DumpingBasicInfo\tj4m2rzp.smc\query0_cxx.dll 2384453_cint.cxx query0_cxx_ACLiC_dict.cxx Creating library C:\Users\gwatts\AppData\Local\Temp\LINQToTTree\DumpingBasicInfo\tj4m2rzp.smc\query0_cxx.lib and object C:\Users\gwatts\AppData\Local\Temp\LINQToTTree\DumpingBasicInfo\tj4m2rzp.smc\query0_cxx.exp 22:45:10 1779 Mst-0 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:10 1779 Mst-0 | Info in <TUnixSystem::ACLiC>: creating shared library /phys/groups/tev/scratch4/users/proofbox/gwatts/session-xxxx-1337406296-1779/master-0-xxxx-1337406296-1779/./query0_cxx.so 22:45:10 383 Wrk-0.3 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:10 383 Wrk-0.3 | Info in <TUnixSystem::ACLiC>: creating shared library /phys/groups/tev/scratch4/users/proofbox/gwatts/session-xxxx-1337406296-1779/worker-0.3-tev05-1337406298-383/./query0_cxx.so 22:45:10 23383 Wrk-0.5 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:10 23383 Wrk-0.5 | Info in <TUnixSystem::ACLiC>: creating shared library /phys/groups/tev/scratch4/users/proofbox/gwatts/session-xxxx-1337406296-1779/worker-0.5-tev07-1337406298-23383/./query0_cxx.so 22:45:10 5229 Wrk-0.2 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:10 5229 Wrk-0.2 | Info in <TUnixSystem::ACLiC>: creating shared library /phys/groups/tev/scratch4/users/proofbox/gwatts/session-xxxx-1337406296-1779/worker-0.2-tev04-1337406298-5229/./query0_cxx.so 22:45:10 3852 Wrk-0.1 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:10 3852 Wrk-0.1 | Info in <TUnixSystem::ACLiC>: creating shared library /phys/groups/tev/scratch4/users/proofbox/gwatts/session-xxxx-1337406296-1779/worker-0.1-tev02-1337406298-3852/./query0_cxx.so 22:45:10 26300 Wrk-0.0 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:10 26300 Wrk-0.0 | Info in <TUnixSystem::ACLiC>: creating shared library /phys/groups/tev/scratch4/users/proofbox/gwatts/session-xxxx-1337406296-1779/worker-0.0-tev01-1337406297-26300/./query0_cxx.so 22:45:10 22773 Wrk-0.4 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:10 22773 Wrk-0.4 | Info in <TUnixSystem::ACLiC>: creating shared library /phys/groups/tev/scratch4/users/proofbox/gwatts/session-xxxx-1337406296-1779/worker-0.4-tev06-1337406298-22773/./query0_cxx.so 22:45:41 26308 Wrk-0.8 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 5261 Wrk-0.18 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 3828 Wrk-0.13 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 23391 Wrk-0.6 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 356 Wrk-0.23 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 26332 Wrk-0.10 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 3844 Wrk-0.16 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 364 Wrk-0.27 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 23375 Wrk-0.33 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 5245 Wrk-0.19 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 22805 Wrk-0.28 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 26292 Wrk-0.12 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 3860 Wrk-0.15 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 425 Wrk-0.24 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 23407 Wrk-0.35 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 5221 Wrk-0.22 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 26324 Wrk-0.11 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 3868 Wrk-0.14 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 417 Wrk-0.25 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 23415 Wrk-0.34 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 5253 Wrk-0.20 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 26316 Wrk-0.9 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 3836 Wrk-0.17 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 22789 Wrk-0.29 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 23399 Wrk-0.7 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 374 Wrk-0.26 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 5237 Wrk-0.21 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 22781 Wrk-0.31 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:41 22765 Wrk-0.32 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 22:45:42 22797 Wrk-0.30 | Info in <TXProofServ::HandleCache>: loading macro query0.cxx+ ... 120518 22:45:38 001 Proofx-E: Conn::CheckResp: server [xxxx.phys.washington.edu:1093] did not return OK replying to last request 120518 22:45:38 001 Proofx-E: Conn::CheckErrorStatus: EXT: sending message to proofserv 120518 22:45:45 001 Proofx-E: Conn::CheckResp: server [xxxx.phys.washington.edu:1093] did not return OK replying to last request 120518 22:45:45 001 Proofx-E: Conn::CheckErrorStatus: error 3006: 'Invalid request: 0' xxxx.phys.washington.edu: Invalid request: 0 Error in <TXSocket::SendRaw>: xxxx.phys.washington.edu: problems sending 8616 bytes to server Warning in <TXSocket::SendStreamerInfos>: problems sending TStreamerInfo's ...