Making TChain more resilient

Dear all,

When working with TChain and reading files from a networked file system (or from xrootd) it can happen that opening a file does not work on the first try (due to timeout?), potentially resulting in a crash.

To be more precise, I add files to a TChain using TChain::Add(), and at some point call TChain::GetListOfLeaves(). If there is a transient issue opening the first file in the chain, the function will return None (python) or nullptr (C++).

In principle I could check whether the return value is sound and call the function again a few times, waiting a few seconds between each try, but is there a better way to make TChain more resilient against these issues?

Cheers,
Sebastien


ROOT Version: 6.16
Platform: SLC7
Compiler: gcc8


May be @pcanal can help you with this question.

There ought to be a call to LoadTree in your code (directly or indirectly). You could check its return value. If it is negative, there was a problem opening the file (see TChain::LoadTree’s documentation for more details).

Thanks for the suggestion! However this would only be practical to check if the first file can be opened, not all of the files added to the TChain, right?

Since I don’t explicitly loop over the events but pass it to RDataFrame I’m afraid this would not be practical…

Along those lines I thought about using TChain::AddFile() (with nentries=0) instead of TChain::Add() to add the files, since it would then try to open the file and return 0 if it didn’t work… But there are problems with this too:

  • As far as I understand, AddFile() opens the file and then closes it again right after, so this is no protection again a transient issue when the file is opened again when the events are looper over.
  • When submitting many jobs this would put an extra strain on the network or filesystem, since all the files (not only the first files in the chains) would be opened at once when the jobs start.

Someone pointed out to me there were some options one could use to increase the timeout or connection attempts: https://root.cern.ch/doc/master/TNetXNGFile_8cxx_source.html#l00798
Perhaps playing with that would help if there is no better solution…

Humm … so the call is done indirectly in RDataFrame … next question (for @eguiraud) is whether the return value is tested and its result propagated somehow by RDataFrame.

TChain::AddFile … But there are problems with this too:

Yes, this is really a sub-optimal solution.

Perhaps playing with that would help if there is no better solution…

If the real problem is that the timeout are ‘too short/small’ for your environment and use case, even when we have improved the error recovery it might still be in your best interest to go that route in order to reduce the number of avoidable errors.

There is another related issue (bug?):

When adding a file path (using TChain::Add) that doesn’t exist to the TChain, as long as the TChain has at least one file that does exist, it will only print an error message when trying to open the problematic path and then quietly move on to the next one without crashing.

The behaviour is the same for both xrootd and dcap.

When using TChain::Add(path, 0) instead the function does return zero as expected. But it seems very dangerous to silently ignore the non-existing path in the above case (since most users leave the default nentries argument)!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.