[POD] Good manners and progress bar

meyerma · January 23, 2017, 8:50am

Hello guys,

I correctly setup a proof-on-demand server. I am running my code on lxplus.
The strategy is to use a compiled proof wrapper (home made):

to connect the pod server
upload my par archive
load my tchain and process it. (My selector are in the par archive)

The setup of my PAR archive is loading an external library (home made; already compiled; binaries stored in my home directory) and an internal library stored in the par archive (this one has to be compiled on each workers).

I have here a subsample of data (238 files) and 128 workers started using pod-submit on lxbatch

So first the starting time is pretty slow on batch. (time between the connection to the pod-server to the start of my last worker). Is it regular to spend 10 minutes just to see it running ? I am not used to use proof-on-demand an until now I was only using proof-lite, so I would like to know the good manners. Using just proof-lite and 8 cpus, it goes actually faster on my computer… I don’t really get it… I’m pretty sure something is wrong !
In the end I would like to process ~4630 files (~[500GB; 1TB]); so is it possible to see the progress bar as it was possible on proof-lite ? That was extremely useful !

Thank the proof experts for the futur answer
Marco

meyerma · January 23, 2017, 9:17am

Ok I noticed that my environment is not kept safe. I think part of my problem might be because my workers don’t get same root and gcc version of my shell session.

Is it a way to keep it safe ?
I use my PAR archive on different cluster not only at lxplus… and also share them.
I don’t want to create an exception for each cluster!

meyerma · January 23, 2017, 10:57am

HEllo !

I solved the problem after few hours looking at the FAQ of PoD ! The problem was an environment problem and my gcc version was reset to the default one on lxplus.

I just created a file here : ~/.PoD/user_worker_env.sh

[code]#! /bin/bash

source /afs/cern.ch/sw/lcg/external/gcc/4.9/x86_64-slc6-gcc49-opt/setup.sh
source /afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.36/x86_64-slc6-gcc49-opt/root/bin/thisroot.sh
export ROOTSYS="/afs/cern.ch/sw/lcg/app/releases/ROOT/5.34.36/x86_64-slc6-gcc49-opt/root/"
[/code]

After this I restart the PoD server and the file automaticaly uploaded on each workers !! (What a wonderful tool)
Then the progress bar of course is here and my code compile correctly !!

NB: I still take any good manners you propose to help me to improve the way I am using this tool

ganis · January 24, 2017, 12:28pm

Dear meyerma,

Sorry for the late reaction.
I am glad that you find hints about the AFS tricks for PoD.
There may be others depending on how things go. It is also possible to start the analysis as soon as you get some workers on, instead of waiting for the whole set to be started by the batch system.
I will be interested in understanding how it goes for you in terms of performance on batch reading from EOS.

G Ganis

meyerma · January 24, 2017, 1:42pm

No problem, I will let you know asap I’m done with my production
Thanks !

meyerma · January 27, 2017, 11:31am

Hi Ganis,

I processed my trees and wanted to give you some feedback. I also got one problem, so I don’t know whether my use of PoD is maybe biais. I assumed I did something wrong. I attached the logfiles to this message !
NB: I had 2713 files to process (stored on EOS).
NB: I tried with differents number of workers: 32, 124, 256.
NB: The logfile attached is corresponding to 256 workers (pod-submit […] 256)

So first, I got some trouble with the last execution of PoD.
The attached logfiles are the case where a problem occured. And I don’t think first it’s because of my code actually, because the 3 first Proof executions where totally fine. (I sent it before going to bed)
I guess after sometime some workers just turned off. Or maybe some communication trouble after too much running time. (~4h: 1h for each processing because mainly of the starting time)
I also ran with 32 and 124 and it takes me forever anyway to start. I don’t really know why.
It looks like the step where my workers are receiving data is pretty slow.
Nevertheless I could reach 5GB/s, when I was running over 124 workers.
Maybe it’s because of fair sharing mechanisms ?

I noticed this error message, each time :

(NB: I set agent_shutdown_if_idle_for_sec=6400; agent_shutdown_if_idle_for_sec=6400)

I tried to recover the bad processing of my 4th data processing (1 for each target I have to process) and I’m currently running over 32 workers. It’s about to end I will post the new logfiles soon.
Proof.txt (407 KB)
ProofPerf.root (334 Bytes)