Rootd file access issue, 5-30

Hi all,

  We recently upgraded our cluster from 5-28a to 5-30 and immediately started experiencing the following error. It happens with the ProofNtuple tutorial as well:

TProofOutputFile::AddFile: error from
TFileMerger::AddFile(rootd://atlprf06.slac.stanford.edu:1093 … tuple.root)

It seems to be directly related with rootd file access being enabled by default:

root.cern.ch/viewcvs/trunk/proof … hrev=40170

We resolved it by adding the following line to our xrootd configuration file, essentially mandating use of the xrootd protocol/port:

xpd.putenv LOCALDATASERVER=root://:1094/

That said, this is supposed to work, right?

-Bart

Hi,

Yes, the daemon should generate such a directive automatically. If you do not put this line, what do you get in the "*.env’ file in the sandbox?
Look for //last-worker-session/worker-0..<…>.env

Gerri

We get the same thing with the proofd port number and starting with rootd://, thus producing the error message from the previous post.

I will add that at this time we’ve reverted completely to 5.28a, as 5.30, with this “fix”, hammers the cluster’s I/O such that jobs run 40% slower and crash at the end during merging.

Hi,

This is definitely more worrying. Is this for the tutorial jobs or for your jobs?
Can you provide the performance tree in the two cases, 5.28a and 5.30? See root.cern.ch/drupal/content/crea … mance-tree

Gerri

[quote=“ganis”]Hi,

I will check it out when I get a chance–we still have 5.30 installed as an option on the cluster for testing.

This is definitely more worrying. Is this for the tutorial jobs or for your jobs?
Can you provide the performance tree in the two cases, 5.28a and 5.30? See root.cern.ch/drupal/content/crea … mance-tree
[/quote]

This is for our jobs, I did not try the tutorial once I fixed the rootd problem. The performance drop/I/O waits start a few minutes into the job (until then it appears to be comparable in event rate to 5.28a) and then increases and levels off. The crashes at the end may be a symptom of the stressed I/O system, as they manifest themselves as ntuple read error and CollectInputFrom(…) errors, but only start after one or more workers have finished and begun merging (that is, some workers start merging, and others then throw exceptions). Small jobs are seemingly unaffected, but large jobs (~30M events) never complete without crashes. I’ll try to get you the performance trees.

Here are performance trees for 5.28a and 5.30, running the same analysis on the same small sample. The problems with 5.30 do not manifest on small samples, so I’m not sure if these will help. If they do not, I can run larger jobs and hope they don’t crash on merging.

slac.stanford.edu/~bcbutler/5.28a_perf.root
slac.stanford.edu/~bcbutler/5.30_perf.root