Hi PROOFers,
We have been using 5.33/03 for a while now (month or two) to make use of the lovely TSelVerifyDataSet, which is an absolute lifesaver for adding large lists of files to the cluster. It worked great, until the other day when one user began having an odd problem--the dataset would verify fine, 100% staged, no missing files, yet only 5% or so of the dataset would process before the job 'appeared' to exit normally. With some digging, it turned out that the list of input files he was using, instead of using the xrootd redirector host in the file URLs (in our case atlprf01 is the redirector), the file list was already resolved to the data servers (atlprf02-16). Apparently, the dataset verification was further resolving the files to file:///, removing the host name. Then, when a job was run, most of the files would be assigned to the wrong node and be missing, and only by luck would some be assigned to the right node and work.
Thinking this might be fixed in the trunk, I updated our cluster to 5.99/01 (the trunk) this afternoon and tried that. Lots of code changes, and now, neither resolved (atlprf02-16) or unresolved (atlprf01) worked. In 5.99/01, the list of files were being resolved twice, once on the master (which did atlprf01->atlprf02-16 or atlprf02-16 -> atlprf02-16, putting the two types of inputs on equal footing) and once on the workers (atlprf02-16 -> file:///).
There seemed to be a couple ways to fix this–one would be disabling adding new URLs to the TFileInfo object on the workers, but other people might have sub-masters and other complicated setups that might have trouble with this. The simplest thing I could think of was to just ensure that no URLs are added to TFileInfo objects during the TSelVerifyDataSet run that don’t contain host names:
// Add url of the disk server in front of the list
TUrl eurl(*(file->GetEndpointUrl()));
eurl.SetOptions(url->GetOptions());
eurl.SetAnchor(url->GetAnchor());
if(strcmp(eurl.GetHostFQDN(),"")) { //enforce presence of host name
fileinfo->AddUrl(eurl.GetUrl(), kTRUE);
if (gDebug > 0) ::Info("TDataSetManager::ScanFile", "added URL %s", eurl.GetUrl());
}
I’m no expert on different PROOF setups, but I can’t think of a situation where pre-resolving a file URL to file:///xxxx would be advantageous in a PROOF dataset, so I don’t think this should harm anything. The change goes twice into TDataSetManager.cxx in TDataSetManager::ScanFile. My svn diff is attached if this is the solution you want to go with.
Thanks,
Bart
Bug report:
savannah.cern.ch/bugs/index.php?94889
svndiff.txt (1.32 KB)