Number of workers for PROOF-Lite

Hello,

I would like to run PROOF-Lite on the total number of cores available in the machine.
Yet, although root finds the correct number of workers, it is set to parallel mode with half of them. So the benchmark runs with half the number of workers. Is it a reason for this? Is there a parameter I can change?

Cheers,
Makis

ps. root v5.34/05, gcc4.4.6

root [0] gEnv->SetValue("ProofLite.Sandbox","/project/atlas/valderan/proofbox3") root [1] TProofBench pb("") +++ Starting PROOF-Lite with 64 workers +++ Opening connections to workers: OK (64 workers) PROOF set to parallel mode (32 workers) (50 %) Run description: PROOF at , 32 workers Info in <TProofBench::SetOutFile>: using default output file: 'proofbench-a0016.mogon-lite-32w-20130327-1155.root' root [2] pb.SetDebug(kTRUE)

As an addition to the above,

the number of workers for the parallel mode is fluctuating with different runs. It goes from 20% up to 55%. I am pretty sure that memory is not a problem and that I am the only user of the machine.

Cheers,
Makis

Hi,

Sorry for the late reply.
This is very strange.
I did not manage to reproduce the problem with the machines at my disposal (max 24 cores).
Does it occur also with just

root [] TProof::Open("")

?
Just to try to isolate where it comes from (whether in TProofBench or TProofLite).

G. Ganis

Hi,

here are my comments up to now.

  1. Starting the proof session with either TProofBench or TProofLite has the same behavior. It opens the connection to all the workers, but when it sets up the worker servers it stops at seemingly random values between 20%-55% of the total number of workers.
  2. The time needed to setup an extra worker server (going from x out of 64 to x+1 out of 64) has a variation of up to 1 minute.
  3. I tested the same root version in a machine with a total of 32 cores (presumably the same OS setup). Here everything works as expected.

I would be happy to provide any more information you require.

Cheers,
Makis

Hi,

This point

may give an hint. This is very strange: usually setting up workers in PROOF-Lite is very fast.
But there is a call back on a Unix socket to the server socket open by the client session: it looks like in your case, a certain fraction of nodes fails to callback, perhaps going into timeout or so.
The path for the unix socket is created is by default created in the temporary directory /tmp: is this directory on a special device? Can you try, if possible, by setting this directory on a different device by using the ROOT-env variable ‘ProofLite.SockPathDir’ ?

Which OS/Arch is this, btw?

Gerri

Hi,

I tried your advice with different paths for the SockPathDir. I don’t see any difference.
Regarding the OS/Arch question. It is a Sientific Linux 6 system with 4 socket nodes with amd proccessors. root-config reports back linuxx8664gcc and linux.
Regarding the filesystems. The filesystems local to the worker nodes are ext4. This is where /tmp exists. The filesystem that I want to read and write from in order to run the benchmark on is a gpfs one.
Any more help I can provide?

Cheers,
Makis

Hi,

Thanks for the info.
Does it create logs for all the 64 processes?
If yes, is there anything interesting in the logs of the processes which are failing to setup?
You should find some ‘last-lite-session’ dir somewhere under ‘/project/atlas/valderan/proofbox3’ where the session files, including logs, are in.

Gerri

PS: in the meanwhile I’ve got access to a 80 core SLC6 machine: the time to start a 80 session is about 1 second; this is what I always saw. We must definitely understand why your workers may take up to a minute to start …

yes, i creates logs for all the processes.
The workers with the problem have the following in the log file

[quote]15:16:08 16154 Wrk-0.60 | Info in TProofServLite::Setup: fWorkDir: /project/atlas/valderan/proofbox95
15:17:47 16154 Wrk-0.60 | Error in TProofServLite::HandleSocketInput: retrieving message from input socket
15:17:47 16154 Wrk-0.60 | Info in TProofServLite::Terminate: starting session termination operations …
15:17:47 16154 Wrk-0.60 | Info in TProofServLite::Terminate: data directory ‘/project/atlas/valderan/proofbox95/data/0.60/0.60-a0404.mogon-1366290968-16154’ has been removed
Terminate: termination operations ended: quitting![/quote]

Thanks for your help up to now!
Makis

In a more carefull observation there are three different classes of log files.

I have log files with 0 size.
I have log files with size ~540bytes, the ones I mentioned in my previous reply.
But there are also log files with only one line

I belive that the log files with 0 length and containing only one line are the ones where proof is not setup, while the files that I mentioned in my previous mail are the correct ones.
Sorry for the confusion.
Makis

Hi,

No confusion, by default you do not get much verbosity at beginning, just the working dir.
So the ones with one line are those OK.
The others seems to screw up the initial message exchange: the client does not receive the correct
message, the worker gets something unexpected (and unknown) after a while and terminates.

Can you repeat the exercise with more verbosity (I should have asked in the first round, sorry) to see if all the expected steps are done on worker side? Just open the session with

TProof::Open("", 0,0,4)

Gerri

Well, I run with higher verbosity, as you suggested, but the outcome didn’t change. I didn’t get more information in the log files.

[quote=“ganis”]Hi,

No confusion, by default you do not get much verbosity at beginning, just the working dir.
So the ones with one line are those OK.
The others seems to screw up the initial message exchange: the client does not receive the correct
message, the worker gets something unexpected (and unknown) after a while and terminates.
[/quote]
I have to say that I disagree with you on this. To my understanding the correct ones are the 4 lines of output. They are the same number of files as the workers that are setup in the proof session. (I started root, started proof and after the initialization I quit root). The files with zero lines or one line should correspond to setup problems.

Cheers,
Makis

Hi,

Ok, I was wrong, sorry.
Then we have also to understand why you get the error message when you quit, but this comes after.
In the sandbox you should have symlinks ‘worker----.log’ to ‘worker-.log’ .
Do they exist for all the workers or only for those that started?
This is to understand if the process started (a pid was assigned) or not even that.

Gerri

symlinks do not exist for log files with zero length
they do exist for the other log files
Makis

I just got another class of log files. I cannot say if it comes from the verbosity level or from a coincidence. It is also without a pid.

[quote]17:58:54 43360 Wrk-0.9 | SysError in TUnixSystem::UnixUnixConnect: connect (No such file or directory)
17:58:54 43360 Wrk-0.9 | Error in TProofServLite::CreateServer: Failed to open connection to the client[/quote]

Any comments are more than welcomed!
Makis

Hi,

So, it seems that there are two kind of problems: i) the process do not even start (they are run in a bash shell with gSystem->Exec()); ii) they are not able to connect back on the Unix socket (which is a special file on /tmp).

I’ll be offline from now until all Monday.
I hope I will come up in the meantime with some ideas about what additional tests we could do.

Cheers, Gerri

Hi,

Here are my new comments.
1.0 I run root under strace and started TProof::Open(). In this case it seems to run normal. I do get all the workers and can actually run the benchmark with full number of workers in the node.
1.1 strace report processes running in both 32 and 64 bit mode. I don’t know if this is correct behavior or not.

  1. There is a hard limit on the virtual memory. ulmit -v reports back ~1.8GB. Real memory has the same limit. I allocate 1.8GBx64 of memory to run proof for the whole node Could this influence proof?

Cheers,
Makis

Hi Makis,

The proofserv processes (the worker processes) are not very hungry, less than 100 MB each at start.
But the sum for 64 may go above 1.8 GB, so if for some reason the ulimit applies to the whole, it may have an impact.
But what do you mean exactly by:

?

Can you give a try by removing this 1.8 GB limit? Just to see if it is related …

Gerri

Hi,

I asked the administrators for the change in the virtual memory limit. I am waiting for their answer.

As for your question. In the batch job submission I have to reserve memory. that memory is 1.8GB per process. So I reserve 1.8GBx64 which is the most I can reserve in a 128GB system. Do you need some more clarification?

May I ask for the reason I am able to get all the workers running under strace. If I understand correctly the only effect of strace is to add some delay. Could it be that there is a timing problem?

Cheers,
Makis

Hi,

So, if I understand correctly, you are starting PROOF-Lite from a batch job. The worker processes are seen as children of the main process in such a case; I am not sure how the memory assignments work in such a case.
Did you mention this to your admin? What batch system is this?

For strace, yes, in principle it should just monitor the system calls, therefore slowing down things.
There is a timeout of 5 secs on the worker callbacks. You can increase that with

ProofLite.StartupTimeOut   timeout_in_seconds

Enter a negative number for no timeout.

Gerri

Thank you for your answer,

The batch system is Platform LSF.
I don’t have access to the batch system for the moment. I will try out your advice as soon as I have access.

Cheers,
Makis