How to tell when chain.Process skips a file

luehring · October 9, 2014, 10:38pm

Hi Everyone,

 Sorry for asking a newbie question but I have just spent several hours trying to understand how to recognize when chain.Process skips a file because the file is unavailable (multiple copies of all files read by the job exist at various sites but sometimes a file can't be opened or can't be read correctly). If there is a thread somewhere answering this just point me at it. The files I am reading are in the ATLAS distributed XRootD data storage system (FAX). I am using an complicated ATLAS analysis code that I did not write so I am struggling to find the right place to look for evidence of a skipped file.

I normally code batch jobs in exactly the way that chain.Process seems to work by default: continue on at all costs to read as many events as possible rather than waste CPU time spent already spent in running the job when a read or file open fails part of the way through the input datastream. However I am testing a a system designed to automatically retry the job if any of the inputs are not read. I guess detecting that a file on the chain has been skipped is trivial but what I tried does not work:

baseElecChan->nWeightedAcceptedEvents  = nWeightedAcc;  // MeV; lower elec pt for testing
chain.SetNotify(baseElecChan);
if (chain.Process(baseElecChan,"",nevents,0) == -1) {
  return 1;
}

I can see that nevents is set to a large number (1000000000) but I don’t know if this kBigNumber. When XRootD can’t access the file I get an error from somewhere but I figure out where. In the job I am looking at I see this message from somewhere but I can’t figure out where (various other message occur depending on why the selected copy of the file won’t open):

Error in TXNetFile::Init: root://fax.mwt2.org//atlas/rucio/data12 … 057.root.2 failed to read the file type data.
Error in TXNetFile::CreateXClient: open attempt failed on root://fax.mwt2.org//atlas/rucio/data12 … 057.root.2

Control never transfers to the notification function setup by the chain.Notify command like it does when the file can be read. The program just seems to quietly go onto the next file. One of the goals of the project is to minimize the number of reads sent out on the network, so this mean that no check of whether the file can actually be found and read is made before the chain.Process command.

Thanks greatly in advance for any advice.

Fred

luehring · October 14, 2014, 1:30pm

Hi Everyone,

I will try to simplify this question so someone might answer. I am using a code I did not write. The code creates a TChain and than uses it with chain.Process command. The chain.Process command is:

(chain.Process(baseElecChan,"",nevents,0)

The files are on a distributed (WAN) XRootD data store and nevents is set to 1000000000. How do I tell that one or more of the files in the chain are skipped? I put a slightly longer snippet of code below my signature showing what I tried that did not work. Even a pointer to a working example would be helpful.

Fred

This did not work:

baseElecChan->nWeightedAcceptedEvents = nWeightedAcc; // MeV; lower elec pt for testing
chain.SetNotify(baseElecChan);
if (chain.Process(baseElecChan,"",nevents,0) == -1) {
return 1;
}

ganis · October 28, 2014, 5:11pm

Dear Fred,

The error on file opening is not recorded or transmitted by TChain, unfortunately. The file is skipped and an error printed on the screen.

One possibility, if you just want to know if all the files have been processed, is to add a counter in the Notify method of your baseElecChan TSelector class. This method is only called on successful file opening, so if you count the number of files successful open in there and compare with chain.GetNtrees() you can detect cases when some of the files are not open.

Does this help?

G. Ganis

luehring · October 28, 2014, 7:00pm

Hi Gerardo,

I will use the work around you suggest but to me it seems like a deficiency to not provide a return code that indicates whether the requested files could be accessed. Ideally there would be a way to no which files failed so they could be retried without having to rerun the whole set of files.

Fred

ganis · October 29, 2014, 9:24am

Hi,

The different possibilities would probably not fit in one return code, but I agree that there is something missing in here.
PROOF fills a small status object with information about the run and returns the list of files which could not be accessed, if any.
Something like that should probably be implemented here too.

Cheers, G

luehring · June 29, 2015, 7:50am

Hi Gerardo,

Did the ROOT team consider providing a method that for users to determine if an XRootD access was successful?

Fred