Hadd script: cause failure and exit status different from 0 if a file is invalid

rbouquet · December 7, 2021, 9:48am

Dear experts,

I was wondering if there is way for the hadd script to stop if an error is raised by ROOT concerning an invalid file or something else.
Because on EOS we can have transient errors in condor jobs sometimes the condor jobs do not manage to open a file properly with for example the error below

SysError in <TFile::ReadBuffer>: error reading from file .... (Input/output error)

In ROOT I know that setting gErrorAbortLevel = kError; causes such error to stop a program and a returned exit status which is different than 0

But for the hadd command since it is an executable I don’t know if it is possible to have such behaviour.
It would be indeed important to provide such option (if not existing already) because hadding takes time and failing if an error occurs saves a lot of time

As currently if such error is raised the hadd command continue and return a status of 0

Many thanks in advance,

ROOT Version: 6.24

rbouquet · December 8, 2021, 9:25am

Tagging @pcanal @couet @henryiii on that question/issue
Many thanks in advance

couet · December 8, 2021, 9:42am

The hadd help mention the -k option which skip the missing or corrupted files.
Can it be a solution ?

Wile_E_Coyote · December 8, 2021, 9:48am

@couet I think the problem is that it is not the ROOT file itself that is “corrupt”, but a temporary “condor” failure. So, you want to “catch” such failures and (automatically) resubmit the corresponding batch jobs.

rbouquet · December 8, 2021, 11:33am

Hi @couet,

Thanks but no it would not be a solution because indeed those files are not corrupted it is just either condor or EOS having a transient issue.
We do not want to skip those files for our analysis they contain histograms.

I think the solution would be to add a new option for the hadd command like -e or something similar for setting the error level

What would stop the hadd script if error are thrown by ROOT is setting the global variable gErrorAbortLevel to gErrorAbortLevel = kError; inside the hadd script.
With the default gErrorAbortLevel the error as just thrown inside the hadd script but the hadd process continue.
And so the only way do detect an error occured is after the hadd do a grep -i Error or something similar
Hadding takes time so it is a huge loss of time/computationnal ressources to not be able to stop the script if a transient error occured.

With that fix the hadd script would be able to be stopped and would return a status different than 0

It would be good indeed that this becomes the default behaviour of the hadd command
As hadd is often used in batch jobs since it takes time and people would be aware a problem occured

rbouquet · December 8, 2021, 11:39am

HI @Wile_E_Coyote,

Yes sure but some hadding can take several hours or a day for our analysis and usually the error I am pointing out is occurring just at the beginning of the hadd when the hadd script checks files

So we are loosing almost two days in that process and we are using computational ressources for nothing as if the hadd add the feature of crashing (which is a 3 line modifications I think) it would save us a lot of time.

In the meantime I created our own hadding script but making this feature available to everyone would be of benefit for many analyses I think especially new ROOT users which would not be even aware of that problem

Many thanks in advance

Wile_E_Coyote · December 8, 2021, 11:49am

@rbouquet Yes, that was precisely my point.

BTW. @couet I’m unsure if one can make “hadd” clever enough to distinguish between “a ROOT file is corrupt” and “a temporary failure to access a ROOT file”.

rbouquet · December 8, 2021, 11:56am

@Wile_E_Coyote ah yes sorry I thought you were telling me to do so
Thanks

And yes I agree with you I think it would be impossible to distinguish between “a ROOT file is corrupt” and “a temporary failure to access a ROOT file”.

So yes to do things properly it would require to not set gErrorAbortLevel = kError; if the -k option is used as for that flag it is intended to skip problematic files.
Opening a zombie/corrupted file having set gErrorAbortLevel = kError; would just stop the hadd

couet · December 8, 2021, 12:34pm

I am not sure either. @pcanal should know.

pcanal · December 8, 2021, 1:11pm

I think that the only way to know the difference between transient and persistent error would be to parse the error message (and even there I am not sure one can do this accurately). Parsing the error message can be done either externally or internally (by replacing the error logger function).

Alternatively, one can “assume” that the errors are transient and try once or twice after seeing an error.

rbouquet · December 13, 2021, 8:42am

Hi @pcanal, @Wile_E_Coyote, @couet,

Shall I open a github issue for the following up of that problem?
I mean already just having the possibility to set gErrorAbortLevel = kError; thanks to a flag in the hadd would be great as it would stop and exit with a failure.
Also checking zombie file can be done before calling the hadd, it is not really complicated.

In most cases the files provided for the hadd should not be zombie otherwise most likely a problem occurred in the job/script creating the problematic file and the user should be aware of it.

I think the most simple thing is (without having to parse errors which would be complicated)

add a new flag for gErrorAbortLevel = kError;
say in the doc it will stop in case a file is zombie
→ robust hadd
do not allow using that new flag if the ignore zombie file flag is turned on
and add in the documentation that a file can be seen as zombie due to a transient error (for instance condor being overloaded)
→ be careful with that option

Because retrying an hadd is fairly simple to implement in C++/Python if the exit status is not 0
compared to having to check no error was raised when hadding. Also the error I was reporting is happening at the beginning of the hadd so it would save a lot of time.

Axel · December 16, 2021, 8:19am

@pcanal not sure I understand. Do you agree that hadd should be erroring out if there is an I/O error on a file (whether the error is sporadic or persistent)?

system · December 30, 2021, 8:19am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.