Proof-lite worker crashes when running certain number of workers

mykhailoD · February 6, 2018, 1:40pm

Hello,

I faced a following problem with the code I run with PROOF (ROOT 6.10/08) on a 4-core machine with HTT: if I set the number of workers equal to 2, it works with no problem. If I increase the number of workers to 4, the one of them crashes with this error:

Info in <TPacketizer::TPacketizer>: Initial number of workers: 4
Validating files: OK (1 files)
Info in <TProofLite::MarkBad>: 0 events	|==================>.| 93.75 % [38733.4 evts/s, 9.0 MB/s, time left: 2.6 s]]
 +++ Message from master at gem904qc8daq.cern.ch : marking gem904qc8daq:-1 (0.2) as bad
 +++ Reason: undefined message in TProof::CollectInputFrom(...)

 +++ Message from master at gem904qc8daq.cern.ch : marking gem904qc8daq:-1 (0.2) as bad
 +++ Reason: undefined message in TProof::CollectInputFrom(...)

 +++ Most likely your code crashed
 +++ Please check the session logs for error messages either using
 +++ the 'Show logs' button or executing
 +++
 +++ root [] TProof::Mgr("gem904qc8daq.cern.ch")->GetSessionLogs()->Display("*")

Same story if I further increase the number of workers.
The log for a failed worker stops after

14:23:12 21858 Wrk-0.2 | SvcMsg in <TProofPlayerSlave::CheckMemUsage>: Memory 4286388 virtual 4024668 resident event 0
14:23:12 21858 Wrk-0.2 | Info in <TEventIterTree::GetTrees>: the tree cache is in learning phase
14:23:13 21858 Wrk-0.2 | Info in <TProofServLite::RestartComputeTime>: compute time restarted after 0.683929 secs (100 entries)

This only happens if I supply a rather large file (~150Mb in this case) as an input, with smaller files 4 workers were working alright.

Do you have any suggestions on what can cause such issue?

Thanks in advance,
Mykhailo

ganis · February 8, 2018, 6:02pm

Dear Mykhailo,
The error message in the main session indicates that the worker gets desynchronized at the level of protocol messages. Why this happens it not easy to say. It does not seem to crash, at least there is no indication.
Some initial questions:
Does the problem happen always at the same relative moment, for example towards the end of the query?
If the job lasts long enough, can you check with top if there is any indication of the proofserv processes looping or going into large memory usage?

A reproducer would be, of course, very welcome.

G Ganis

ganis · February 9, 2018, 9:00am

looking again at your log messages, I see that the process was using 4024668 kB of resident memory, i.e. 4 GB, which is quite big. How much memory does the machine have? One explanation could be that the process gets killed by the system because it uses too much memory. Please check your memory usage/consumption.

And since you are on 6.10, perhaps we can consider moving to TProcessExecutor which has been proposed as successor of ProofLite, with also the goal of optimizing memory usage.

G Ganis

mykhailoD · February 13, 2018, 2:13pm

Hi,

indeed, the memory consumption is huge and reaches up to 4GB/process. In case of 4 workers the crash is happening at the same place all the time - probably process just get killed. I will investigate where the leak can happen… Could you please also advise me where I can get some examples/tutorial on TProcessExecutor? I’ve only found the class reference so far.

Thanks,
-m

ganis · February 20, 2018, 7:50am

Hi,

Sorry, for some reason I was not notified of your reply.
Have a look at the tutorial macros ‘tutorials/multicore/mp*.C’ .
Have a look and let me know, I can help you in giving a try.

G Ganis

system · March 6, 2018, 7:50am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.