Will job fail on this WN?

Dear Expert:

Sometimes proof reports errors like the following:

+++ Starting max 5 workers following the setting of PROOF_NWORKERS
Looking up for exact location of files: OK (851 files)
Validating files: OK (851 files)
0.33: caught exception triggered by signal ‘1’ while processing dset:‘physics’, file:‘root://valtical07.cern.ch//localdisk/xrootd/users/qing/mc12_p1067/user.qing.mc12_8TeV.107660.AlpgenJimmy_AUET2CTEQ6L1_ZmumuNp0.merge.NTUP_SMWZ.e1218_s1469_s1470_r3542_r3549_p1067_2LepSkim_v2/user.qing.000791._00006.skimmed.root’, event:0 - check logs for possible stacktrace
Worker ‘valtical07.cern.ch-0.33’ has been removed from the active list

+++ Message from top master at valtical.cern.ch:1093 : marking valtical07.cern.ch:1093 (0.33) as bad
+++ Reason: received kPROOF_FATAL

+++ Most likely your code crashed on worker 0.33 at valtical07.cern.ch:1093.
+++ Please check the session logs for error messages either using
+++ the ‘Show logs’ button or executing
+++
+++ root [] TProof::Mgr(“valtical.cern.ch:1093”)->GetSessionLogs()->Display(“0.33”,0)

0.13: caught exception triggered by signal ‘1’ while processing dset:‘physics’, file:‘root://valtical05.cern.ch//localdisk/xrootd/users/qing/mc12_p1067/user.qing.mc12_8TeV.107660.AlpgenJimmy_AUET2CTEQ6L1_ZmumuNp0.merge.NTUP_SMWZ.e1218_s1469_s1470_r3542_r3549_p1067_2LepSkim_v2/user.qing.000791._00010.skimmed.root’, event:0 - check logs for possible stacktrace
Worker ‘valtical05.cern.ch-0.13’ has been removed from the active list

+++ Message from top master at valtical.cern.ch:1093 : marking valtical05.cern.ch:1093 (0.13) as bad
+++ Reason: received kPROOF_FATAL

+++ Most likely your code crashed on worker 0.13 at valtical05.cern.ch:1093.
+++ Please check the session logs for error messages either using
+++ the ‘Show logs’ button or executing
+++
+++ root [] TProof::Mgr(“valtical.cern.ch:1093”)->GetSessionLogs()->Display(“0.13”,0)

The 2 files are good files, seems to me the 2 proof WNs has problem accessing them and then they are marked as bad, my questions is :

Will the 2 files be re-processed when the the 2 WNs are marked as bad?

Cheers,Gang

Hi:

I just made a test which shows the the 2 files are not processed when proof report those 2 WNs are bad, then how can we avoid such errors? Seems it happens randomly on the WNs.

Cheers,Gang

Hi,

Bad file re-assignment is tricky because is not easy (if at all possible) to decide that a file giving problems to a worker does not give the same problems to other workers.
Proof should produce a missing file list at the end with the list of files not processed, so that you can decide what to do.
It is true however, that it is assumed that files are accessible by all workers in the same way. If the files are on the machines there is a way to force locality, i.e. to have workers to process only the files on their disks. For that you have to set

proof->SetParameter("PROOF_ForceLocal", (Int_t) 1);

But is not possible to veto processing of certain files for certain machines.
Any idea why those workers cannot access those files?

G. Ganis

Hi, Ganis:

Thanks for the clarification, but how does proof decides that a file is a local file or not? for example, This following file is physically saved on valtical07 and on valtical07 we have 8 cores configured in proofd, if one proof job is sent to valtical07, will this job recognize this file as a local one?

root://valtical.cern.ch//localdisk/xroo … immed.root

Cheers,Gang

It should, if the xrootd system serving the file is configured correctly.
The locate function of xrootd should resolve the file as residing on valtical07 and then the packetizer matches the host names.

You can check this by doing

root [] TFile *f = TFile::Open("root://valtical.cern.ch//localdisk/xroo ... immed.root")
root [] f->GetEndpointUrl()->GetUrl()

This should print an URL with valtical07 …

G. Ganis