TProofOutputFile merging error

Hello,
I am using using a TSelector and PROOFLite (5.22). I seem to have finally solved most of my issues that have been fouling my program, save one that I encountered today.

I am using TProofOutputFile. When I run a test job (where the output tree size is ~1GB) everything runs smoothly. But when running over a full dataset (output size ~10GB) the program seems to run correctly until the files are to be merged (this does not occur, no output file). Looking at the log files I find the following errors:

worker-0.6.log -

SlaveTerminate
File pointer exists
Writing Tree
Closing fFile
fProofFile Print
leave SlaveTerminate
15:23:07 7261 Wrk-0.6 | *** Break ***: write on a pipe with no one to read it
15:23:07 7261 Wrk-0.6 | SysError in TUnixSystem::UnixSend: send (Broken pipe)
15:23:07 7261 Wrk-0.6 | SysError in TProofServLite::SendLogFile: error sending log file (Broken pipe)
15:23:07 7261 Wrk-0.6 | SysError in TUnixSystem::DispatchOneEvent: select: read error on 34
(Bad file descriptor)

from worker-0.5.log -
leave SlaveTerminate
15:23:34 7259 Wrk-0.5 | SysError in TProofServLite::SendLogFile: error sending log file (No such file
or directory)
15:23:34 7259 Wrk-0.5 | SysError in TUnixSystem::DispatchOneEvent: select: read error on 34
(Bad file descriptor)

and the rest of the log files (6 more) -
leave SlaveTerminate
15:23:07 7263 Wrk-0.7 | Error in TProofServLite::HandleSocketInput: retrieving message from input soc
ket
15:23:07 7263 Wrk-0.7 | Info in TProofServLite::Terminate: starting session termination operations …
.
Terminate: termination operations ended: quitting!

Can anyone shed some light on these error msgs?

Trevor

Dear Trevor,

These messages just indicate that the connections to those workers went down.

Did you get anything on the client window? Either from the window log or issuing TProof::ShowLog() in the shell.

Are the partial files created correctly in the worker sandboxes? You can get the location of the sandboxes (working dirs) from TProof::Print(“a”) .

G. Ganis

One problem is that I can’t issue these commands from the root interpreter as I am kicked out to the shell after the completion of the processing. That is, the program does process all the events, the client window shows no errors upon completion (and for a smaller subsample of events everything including the merging is fine). If I look at my session the final output I see somthing like:


Looking up for exact location of files: OK (336 files)
Looking up for exact location of files: OK (336 files)
Validating files: OK (336 files)
Output file: SimpleNtuple.root
Output file: SimpleNtuple.root
Output file: SimpleNtuple.root
Output file: SimpleNtuple.root
Output file: SimpleNtuple.root
Output file: SimpleNtuple.root
Output file: SimpleNtuple.root
[zcanada2] /canada/zcanada2a/stewartt/charm_eff $ <- simply kicked to the shell

where I am running with 8 (or 4 or 6 etc) nodes (but one of the files isn’t created). I can look in the .proof directory at the logs (hence the error messages in the original post). Looking in the tmp directory (redirected using the TMPDIR env variable to a 2TB drive… so plenty of space). I see the following:

-rw-r–r-- 1 stewartt zeus 396 May 21 00:55 ROOTMERGED-439654ca-4591-11de-8001-9c46a983beef.root
-rw-r–r-- 1 stewartt zeus 396 May 20 06:15 ROOTMERGED-e3516902-44f4-11de-8001-9c46a983beef.root
-rw-r–r-- 1 stewartt zeus 396 May 20 15:20 ROOTMERGED-fd500470-4540-11de-8001-9c46a983beef.root
-rw-r–r-- 1 stewartt zeus 0 May 20 01:04 proof-cache-lock-%canada%zcanada2a%stewartt%charm_eff%.proof%cache
-rw-r–r-- 1 stewartt zeus 0 May 20 00:58 proof-query-lock-zcanada2-1242773882-13695-%canada%zcanada2a%stewartt%charm_eff%.proof%canada-zcanada2a-stewartt-charm_eff%queries
-rw-r–r-- 1 stewartt zeus 0 May 20 03:46 proof-query-lock-zcanada2-1242783992-23002-%canada%zcanada2a%stewartt%charm_eff%.proof%canada-zcanada2a-stewartt-charm_eff%queries
-rw-r–r-- 1 stewartt zeus 0 May 20 13:25 proof-query-lock-zcanada2-1242818713-7192-%canada%zcanada2a%stewartt%charm_eff%.proof%canada-zcanada2a-stewartt-charm_eff%queries
-rw-r–r-- 1 stewartt zeus 0 May 20 13:25 proof-query-lock-zcanada2-1242818736-7239-%canada%zcanada2a%stewartt%charm_eff%.proof%canada-zcanada2a-stewartt-charm_eff%queries
-rw-r–r-- 1 stewartt zeus 0 May 20 22:58 proof-query-lock-zcanada2-1242853108-12135-%canada%zcanada2a%stewartt%charm_eff%.proof%canada-zcanada2a-stewartt-charm_eff%queries
srwxr-xr-x 1 stewartt zeus 0 May 20 00:58 prooflite-sockpath-zcanada2-1242773882-13695
srwxr-xr-x 1 stewartt zeus 0 May 20 03:46 prooflite-sockpath-zcanada2-1242783992-23002
srwxr-xr-x 1 stewartt zeus 0 May 20 13:25 prooflite-sockpath-zcanada2-1242818713-7192
srwxr-xr-x 1 stewartt zeus 0 May 20 13:25 prooflite-sockpath-zcanada2-1242818736-7239
srwxr-xr-x 1 stewartt zeus 0 May 20 22:58 prooflite-sockpath-zcanada2-1242853108-12135

this file being from my last attempt
May 21 00:55 ROOTMERGED-439654ca-4591-11de-8001-9c46a983beef.root

Thanks
Trevor

Just as a addendum for the worker-0.5 and 0.6 the sandbox’s for each node have a well constructed root file which would have been merged (I assume) if the merging routine had begun.

worker-0.5/SimpleNtuple-f7be2dc0-4531-11de-8001-9c46a983beef.root
worker-0.6/SimpleNtuple-f7bb3f84-4531-11de-8001-9c46a983beef.root

Trevor

Dear Trevor,

Sorry for the late reply.
I will try to reproduce your problem with increasingly larger files and let you know.

G. Ganis

PS: you can get the logs of the previous session by restarting ROOT and executing

root [0] TProofLog *pl = TProof::Mgr("")->GetSessionLogs()