Corrupted ROOT files without reason ?!

Dear colleauges,

When running a big Monte-Carlo production producing root files, (let’s say 1000 jobs), all finish normally. However, the results files, when I try to hadd them all together, seem to be “Zombies”, and I get this error in hadd -f206 :

>Error in <TFile::Init>: file Pion_v44_Nikosbeam_300.GeV_791.root is truncated at 6225920 bytes: should be 16831501, trying to recover
Info in <TFile::Recover>: Pion_v44_Nikosbeam_300.GeV_791.root, recovered key TDirectoryFile:VirtualDetector at address 266
Info in <TFile::Recover>: Pion_v44_Nikosbeam_300.GeV_791.root, recovered key TDirectoryFile:NTuples at address 395
Warning in <TFile::Init>: successfully recovered 2 keys
Error in <TFile::ReadBuffer>: error reading all requested bytes from file Pion_v44_Nikosbeam_300.GeV_791.root, got 0 of 4493
Warning in <TFile::GetRecordHeader>: Pion_v44_Nikosbeam_300.GeV_791.root: failed to read the StreamerInfo data from disk.
Error in <TFileMerger::AddFile>: cannot open file Pion_v44_Nikosbeam_300.GeV_791.root

Maybe you have an idea on:

  1. What could be causing this ?! is there any way to “repair” these corrected files ? I triple checked the specific job’s output file, and the job seems to have finished normally…however the *.root seems corrupted that way.

  2. Is there a way to write a script/small program that “tests” these files after the jobs are finished, so that I re-submit them before I start the hadd process ? I am not an expert on the subject and I wonder if this is possible at all.

Kind regards and many thanks

Nikos

Hi,
dumb question: are you 100% sure that file->Close() (or equivalently the TFile destructor) is called for these files?

Cheers,
Enrico

Dear Enrico,

many thanks for the prompt reply. Actually these are ROOT files that are
created directly by G4beamline
(http://www.muonsinternal.com/muons3/G4beamline ) and I have no
knowledge of the “mechanism” that they are made (if the program properly
closes them). So the answer to your question is : I don’t know, and I am
not sure how I can do/check this ? If not, is there a way to do it myself?

Again, many thanks for your reply
Cheers,
Nikos

Uhm if you are getting corrupted files out of the machinery, you might have better luck asking the people responsible for the machinery itself.

@pcanal might be able to help on the matter of checking if a file is corrupted (the most basic check is to open the file and then look at file->IsZombie(), but there are other checks possible if I’m not mistaken).

Cheers,
Enrico

This indicates that the file ‘lost’ a large fraction of itself (2/3 in this case). Most likely something wrong happened on the file system(s) since they were produced, it could be anything from the disk being full when they were produced to an aborted download/copy.

Cheers,
Philippe.

Dear Philippe,

Your help is much appreciated. I will try to investigate what is going
on… My question would be, is there any “small” program that one
could use, to “test” these files for this effect when they are produced,
so that I could re-run them (before I start the hadd process) ?

C++ is not one of my strong points…

Cheers and many thanks
Nikos

Hi Nikos,

You could use (with the shell variable ‘filename_to_check’ set to the name of the file

 root.exe -b -l -e "auto f = TFile::Open(\"$filename_to_check\"); if (f == nullptr || f->IsZombie()) exit(1);"

(or you can put those 2 statement in a C++ script)

Cheers,

Philippe.

Dear Philippe,

This is really great. Many thanks!!

Would I be very rude, if I just asked you how it is possible to make the one-liner to check all the *.root files in the run directory, and report on the ones that are crashed ?

Trying your script in lxplus with :

root.exe -b -l -e “auto f = TFile::Open(“Pion_v44_Nikosbeam_250.GeV_10000.root”); if (f == nullptr || f->IsZombie()) exit(1);”

gives a quite strange output :

root [0]
ROOT_cli_0:1:1: error: Syntax error
auto f = TFile::Open(Pion_v44_Nikosbeam_250.GeV_10000.root); if (f == nullptr || f->IsZombie()) exit(1);
^
FunctionDecl 0x1f26438 <input_line_8:1:1, ROOT_cli_0:3:1> input_line_8:1:6 __cling_Un1Qu30 'void (void *)'
|-ParmVarDecl 0x1f26390 <col:22, col:28> col:28 vpClingValue 'void *'
|-CompoundStmt 0x20f9fe8 <col:42, ROOT_cli_0:3:1>
| |-DeclStmt 0x1f4a760 <line:1:1, col:60>
| | `-VarDecl 0x1f26530 <col:1, col:59> col:6 used f 'auto' cinit
| |   `-CallExpr 0x1f4a708 <col:10, col:59> '<dependent type>'
| |     |-UnresolvedLookupExpr 0x1f4a4e8 <col:10, col:17> '<overloaded function type>' lvalue (no ADL) = 'Open' 0x1f27ba0 0x1f49fe8
| |     `-CXXDependentScopeMemberExpr 0x1f4a6b0 <col:22, col:55> '<dependent type>' lvalue .root
| |       `-CXXDependentScopeMemberExpr 0x1f4a658 <col:22, col:45> '<dependent type>' lvalue .GeV_10000
| |         `-DeclRefExpr 0x1f4a610 <col:22> '<dependent type>' lvalue Var 0x1f4a548 'Pion_v44_Nikosbeam_250' '<dependent type>'
| |-IfStmt 0x20f9fa0 <col:62, col:103>
| | |-<<<NULL>>>
| | |-<<<NULL>>>
| | |-CXXOperatorCallExpr 0x20f9b38 <col:66, col:94> '<dependent type>'
| | | |-UnresolvedLookupExpr 0x20f9a70 <col:79> '<overloaded function type>' lvalue (ADL) = 'operator||' 0x20e3460 0x20e3a10 0x20e3ff0 0x20e45c0 0x20e4b70 0x20e5248 0x20e5648 0x20e6248 0x20e6fc8 0x20e7a18 0x20f6490 0x20f6f58 0x20f7a28 0x20f8460 0x20f8d68
| | | |-CXXOperatorCallExpr 0x20e3368 <col:66, col:71> '<dependent type>'
| | | | |-UnresolvedLookupExpr 0x20e3000 <col:68> '<overloaded function type>' lvalue (ADL) = 'operator==' 0x1f4a7e0 0x1f4ac08 0x1f4b048 0x1f4b6c0 0x1f4bb38 0x1f4bf78 0x1f4c1d0 0x1f4c428 0x1f4c680 0x1f4d860 0x1f4eb18 0x1f4f298 0x1f4f5e8 0x1f4fb68 0x1f500d0 0x1f50508 0x1f514d0 0x1f51d88 0x1f523b8 0x1f52a08 0x1f52f38 0x1f54c70 0x1f54fd8 0x1f585e8 0x1f58c08 0x1f59530 0x1f59b68 0x1f5a940 0x1f5aee8 0x1f5b840 0x1f5c8f0 0x1f5ce48 0x1f5d5b8 0x1f60f68 0x1f615a0 0x1f61bd0 0x1f62218 0x1f63048 0x1f636b8 0x1f64278 0x1f64878 0x1f69c78 0x1f6a1b0 0x1f6a6e0 0x1f6c308 0x1f6c8d8 0x1f6ce48 0x1f6d4b0 0x1f6d930 0x1f6ddb0 0x1f6f608 0x1f6f970 0x1f6fce0 0x1f6ff40 0x1f701a0 0x1f71368 0x1f71c30 0x1f728a8 0x1f75860 0x1f76648 0x1f77370 0x1f78408 0x1f7d4b0 0x1f7e970 0x1f80930 0x1f84a40 0x1f86f88 0x1f89748 0x1f8c168 0x1f8f078 0x1f91a20 0x1f959a0 0x1f97028 0x1fa0e48 0x1fa48c0 0x1fa55b8 0x1fa6548 0x1fa7070 0x1fa7518 0x1fa9818 0x1faa330 0x1fabd70 0x1fac520 0x1facaf0 0x1fad090 0x1fad660 0x1fb1740 0x1fb25b8 0x1fb3188 0x1fb3fb8 0x1fb46a0 0x1fb8e28 0x1fbe6f8 0x1fbf578 0x1fbffc8 0x1fc14b8 0x1fc1f88 0x1fc4aa0 0x1fc53a8
| | | | |-DeclRefExpr 0x1f4a778 <col:66> 'auto' lvalue Var 0x1f26530 'f' 'auto'
| | | | `-CXXNullPtrLiteralExpr 0x1f4a7c0 <col:71> 'nullptr_t'
| | | `-CallExpr 0x20e3430 <col:82, col:94> '<dependent type>'
| | |   `-CXXDependentScopeMemberExpr 0x20e33d8 <col:82, col:85> '<dependent type>' lvalue ->IsZombie
| | |     `-DeclRefExpr 0x20e33b0 <col:82> 'auto' lvalue Var 0x1f26530 'f' 'auto'
| | |-CallExpr 0x20f9f70 <col:97, col:103> 'void'
| | | |-ImplicitCastExpr 0x20f9f58 <col:97> 'void (*)(int) __attribute__((noreturn)) throw()' <FunctionToPointerDecay>
| | | | `-DeclRefExpr 0x20f9ed8 <col:97> 'void (int) __attribute__((noreturn)) throw()' lvalue Function 0x20f9b88 'exit' 'void (int) __attribute__((noreturn)) throw()'
| | | `-IntegerLiteral 0x20f9eb8 <col:102> 'int' 1
| | `-<<<NULL>>>
| `-NullStmt 0x20f9fd8 <line:2:1>
`-AnnotateAttr 0x1f4a5a8 <<invalid sloc>> R"ATTRDUMP(__ResolveAtRuntime)ATTRDUMP"
<<<NULL>>>
root [1]

Sorry to be bothering you, once more ! Your help is much appreciated.

Cheers,

Nikos

There is a fatal typo :slight_smile: in the copy … the innner quote must be escape with \:

root.exe -b -l -e "auto f = TFile::Open(\"Pion_v44_Nikosbeam_250.GeV_10000.root\"); if (f == nullptr || f->IsZombie()) exit(1);"

In bash you can do:

for filename_to_check in *.root
do
    root.exe -b -l -e "auto f = TFile::Open(\"$filename_to_check\"); if (f == nullptr || f->IsZombie()) exit(1);"
    ... here or so check the result ... 
done;

or even

for filename_to_check in *.root
do
    root.exe -b -l -e "auto f = TFile::Open(\"$filename_to_check\"); if (f == nullptr || f->IsZombie()) { cout << \"There is a problem with the file: $filename_to_check\n\"; exit(1); }”
done;

Dear @pcanal,

I have no words to thank you for your efforts. Can you solve me a final question ? What am I doing wrong and the script does not exit properly ? When I use the bash version:

>[14:32:14] bash> cat t.sh 
#!/bin/bash
for filename_to_check in *.root
  do
# echo -e "I found filename", $filename_to_check ;
   root.exe -b -l -e 'auto f = TFile::Open(\"$filename_to_check\"); if (f == nullptr || f->IsZombie()) { cout << \"There is a problem with the file: $filename_to_check\n\"; exit(1); }'
  done ;

what I get is an infinite root loop which I cannot exit from.

> [14:33:14] bash>./t.sh
root [0]
root (cont'ed, cancel with .@) [1].q
root [0]
root (cont'ed, cancel with .@) [1].@
root [2] .q
root [0]
root (cont'ed, cancel with .@) [1].@
root [2] .q
.root [0]
root (cont'ed, cancel with .@) [1].q
root [0]
root (cont'ed, cancel with .@) [1]

I tried to change the exit(1) to sys.exit() but it did not work…

Many many thanks for your kind reply and your interest !

Cheers,
Nikos

Hi,

If you want to execute root.exe in the loop, you should add -q argument. Like:

root.exe -b -l -q  -e 'auto f = TFile::Open(...'

This will exit ROOT session with exit code 0 once all commands are executed.

Regards,
Sergey

Hi,

As Sergey pointed out, the -q is necessary but also (see your favorite shell script manual) using single quote around the string means that escaping the double quote is no longer needed but also means that shell variable are not expanded … i.e you can not use single quote in this case. Please try

#!/bin/bash
for filename_to_check in *.root
  do
# echo -e "I found filename", $filename_to_check ;
   root.exe -b -l -q -e "auto f = TFile::Open(\"$filename_to_check\"); if (f == nullptr || f->IsZombie()) { cout << \"There is a problem with the file: $filename_to_check\n\"; exit(1); }"
  done ;

Also make sure that during the copy/paste the double quote " are not changed to the fancy kind:

Cheers,
Philippe.

Dear Sergey and Philippe,

I have no words to thank you for your kind help, and many advices !!!
Now, at least, I can check the files before I merge.

Kind regards and many thanks
Nikos

Dear @pcanal and @linev,

I hate to bother you again. However, despite the “control” that passed the test, I now get this message in the hadd -f206 that was not detected by the “sanity check”. Is there anyway to test also this ?

Error in <TFile::ReadBuffer>: error reading all requested bytes from file Pion_v44_Nikosbeam_250.GeV_5379.root, got 0 of 300
Error in <TFile::Init>: Pion_v44_Nikosbeam_250.GeV_5379.root failed to read the file type data.
Error in <TFileMerger::OpenExcessFiles>: cannot open file Pion_v44_Nikosbeam_250.GeV_5379.root
Error in <TFileMerger::Merge>: error during merge of your ROOT files

Many thanks !
Nikos

Hi Nikos,

Seems to be, you have empty file with 0 bytes.
When you run you checker script, you should see same kind of messages:

Error in <TFile::ReadBuffer>: error reading all requested bytes from file Pion_v44_Nikosbeam_250.GeV_5379.root, got 0 of 300
Error in <TFile::Init>: Pion_v44_Nikosbeam_250.GeV_5379.root failed to read the file type data.

Can it be that file “Pion_v44_Nikosbeam_250.GeV_5379.root” simply was not checked?

Regards,
Sergey

Hi Sergei,

It looks indeed corrupted:

root.exe -b -l -q -e "auto f = TFile::Open(\"Pion_v44_Nikosbeam_250.GeV_5379.root\"); if (f == nullptr || f->IsZombie()) exit(1);"

Error in <TFile::ReadBuffer>: error reading all requested bytes from file Pion_v44_Nikosbeam_250.GeV_5379.root, got 0 of 300
Error in <TFile::Init>: Pion_v44_Nikosbeam_250.GeV_5379.root failed to read the file type data.

Therefore it must be that the loop just did not break when the error appeared. I will investigate ! Many thanks for the hint!!

Cheers,
Nikos

Your script does not analyze return value from root.exe.
You should do something like:

#!/bin/bash
for filename_to_check in *.root
  do
   root.exe -b -l -q -e "..."
   if [ $? -ne 0 ]; then
        echo $filename_to_check has error
        exit 1
   fi
 done ;

Dears @linev and @pcanal,

This solved my problem. Your help is much, much much appreciated !
Cheers,
Nikos

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.