Rootd file access issue, 5-30

bbutler · July 15, 2011, 9:24pm

Hi all,

  We recently upgraded our cluster from 5-28a to 5-30 and immediately started experiencing the following error. It happens with the ProofNtuple tutorial as well:

TProofOutputFile::AddFile: error from
TFileMerger::AddFile(rootd://atlprf06.slac.stanford.edu:1093 … tuple.root)

It seems to be directly related with rootd file access being enabled by default:

root.cern.ch/viewcvs/trunk/proof … hrev=40170

We resolved it by adding the following line to our xrootd configuration file, essentially mandating use of the xrootd protocol/port:

xpd.putenv LOCALDATASERVER=root://:1094/

That said, this is supposed to work, right?

-Bart

ganis · July 18, 2011, 4:09pm

Hi,

Yes, the daemon should generate such a directive automatically. If you do not put this line, what do you get in the "*.env’ file in the sandbox?
Look for //last-worker-session/worker-0..<…>.env

Gerri

bbutler · July 18, 2011, 6:51pm

We get the same thing with the proofd port number and starting with rootd://, thus producing the error message from the previous post.

bbutler · July 18, 2011, 6:53pm

I will add that at this time we’ve reverted completely to 5.28a, as 5.30, with this “fix”, hammers the cluster’s I/O such that jobs run 40% slower and crash at the end during merging.

ganis · July 18, 2011, 8:34pm

Hi,

bbutler:

We get the same thing with the proofd port number and starting with rootd://, thus producing the error message from the previous post.[/quote]
Yes, sorry, this is what is expected, to use rootd via the same proof port. Actually in several tests we did this was not interfering with any existing xrootd setup and was facilitating the access to sandboxes. So we decided to have it ON by default. We may revise this after your findings. It would be interesting to find out why it failed in your case. Did you find any error on the servers logs; once started the ‘rootd’ daemon by itself logs to syslog. To turn it OFF you can add the line
xpd.rootd off
After this the henerated line with LOCALDATASERVER should be the same as the one you have.

[quote=“bbutler”]I will add that at this time we’ve reverted completely to 5.28a, as 5.30, with this “fix”, hammers the cluster’s I/O such that jobs run 40% slower and crash at the end during merging.

This is definitely more worrying. Is this for the tutorial jobs or for your jobs?
Can you provide the performance tree in the two cases, 5.28a and 5.30? See root.cern.ch/drupal/content/crea … mance-tree

Gerri

bbutler · July 18, 2011, 11:22pm

[quote=“ganis”]Hi,

bbutler:

We get the same thing with the proofd port number and starting with rootd://, thus producing the error message from the previous post.[/quote]
Yes, sorry, this is what is expected, to use rootd via the same proof port. Actually in several tests we did this was not interfering with any existing xrootd setup and was facilitating the access to sandboxes. So we decided to have it ON by default. We may revise this after your findings. It would be interesting to find out why it failed in your case. Did you find any error on the servers logs; once started the ‘rootd’ daemon by itself logs to syslog. To turn it OFF you can add the line
xpd.rootd off
After this the henerated line with LOCALDATASERVER should be the same as the one you have.

I will check it out when I get a chance–we still have 5.30 installed as an option on the cluster for testing.

This is definitely more worrying. Is this for the tutorial jobs or for your jobs?
Can you provide the performance tree in the two cases, 5.28a and 5.30? See root.cern.ch/drupal/content/crea … mance-tree
[/quote]

This is for our jobs, I did not try the tutorial once I fixed the rootd problem. The performance drop/I/O waits start a few minutes into the job (until then it appears to be comparable in event rate to 5.28a) and then increases and levels off. The crashes at the end may be a symptom of the stressed I/O system, as they manifest themselves as ntuple read error and CollectInputFrom(…) errors, but only start after one or more workers have finished and begun merging (that is, some workers start merging, and others then throw exceptions). Small jobs are seemingly unaffected, but large jobs (~30M events) never complete without crashes. I’ll try to get you the performance trees.

bbutler · August 3, 2011, 6:55am

Here are performance trees for 5.28a and 5.30, running the same analysis on the same small sample. The problems with 5.30 do not manifest on small samples, so I’m not sure if these will help. If they do not, I can run larger jobs and hope they don’t crash on merging.

slac.stanford.edu/~bcbutler/5.28a_perf.root
slac.stanford.edu/~bcbutler/5.30_perf.root