When running a big Monte-Carlo production producing root files, (let’s say 1000 jobs), all finish normally. However, the results files, when I try to hadd them all together, seem to be “Zombies”, and I get this error in hadd -f206 :
>Error in <TFile::Init>: file Pion_v44_Nikosbeam_300.GeV_791.root is truncated at 6225920 bytes: should be 16831501, trying to recover
Info in <TFile::Recover>: Pion_v44_Nikosbeam_300.GeV_791.root, recovered key TDirectoryFile:VirtualDetector at address 266
Info in <TFile::Recover>: Pion_v44_Nikosbeam_300.GeV_791.root, recovered key TDirectoryFile:NTuples at address 395
Warning in <TFile::Init>: successfully recovered 2 keys
Error in <TFile::ReadBuffer>: error reading all requested bytes from file Pion_v44_Nikosbeam_300.GeV_791.root, got 0 of 4493
Warning in <TFile::GetRecordHeader>: Pion_v44_Nikosbeam_300.GeV_791.root: failed to read the StreamerInfo data from disk.
Error in <TFileMerger::AddFile>: cannot open file Pion_v44_Nikosbeam_300.GeV_791.root
Maybe you have an idea on:
What could be causing this ?! is there any way to “repair” these corrected files ? I triple checked the specific job’s output file, and the job seems to have finished normally…however the *.root seems corrupted that way.
Is there a way to write a script/small program that “tests” these files after the jobs are finished, so that I re-submit them before I start the hadd process ? I am not an expert on the subject and I wonder if this is possible at all.
many thanks for the prompt reply. Actually these are ROOT files that are
created directly by G4beamline
(http://www.muonsinternal.com/muons3/G4beamline ) and I have no
knowledge of the “mechanism” that they are made (if the program properly
closes them). So the answer to your question is : I don’t know, and I am
not sure how I can do/check this ? If not, is there a way to do it myself?
Uhm if you are getting corrupted files out of the machinery, you might have better luck asking the people responsible for the machinery itself.
@pcanal might be able to help on the matter of checking if a file is corrupted (the most basic check is to open the file and then look at file->IsZombie(), but there are other checks possible if I’m not mistaken).
This indicates that the file ‘lost’ a large fraction of itself (2/3 in this case). Most likely something wrong happened on the file system(s) since they were produced, it could be anything from the disk being full when they were produced to an aborted download/copy.
Your help is much appreciated. I will try to investigate what is going
on… My question would be, is there any “small” program that one
could use, to “test” these files for this effect when they are produced,
so that I could re-run them (before I start the hadd process) ?
Would I be very rude, if I just asked you how it is possible to make the one-liner to check all the *.root files in the run directory, and report on the ones that are crashed ?
Trying your script in lxplus with :
root.exe -b -l -e “auto f = TFile::Open(“Pion_v44_Nikosbeam_250.GeV_10000.root”); if (f == nullptr || f->IsZombie()) exit(1);”
There is a fatal typo in the copy … the innner quote must be escape with \:
root.exe -b -l -e "auto f = TFile::Open(\"Pion_v44_Nikosbeam_250.GeV_10000.root\"); if (f == nullptr || f->IsZombie()) exit(1);"
In bash you can do:
for filename_to_check in *.root
do
root.exe -b -l -e "auto f = TFile::Open(\"$filename_to_check\"); if (f == nullptr || f->IsZombie()) exit(1);"
... here or so check the result ...
done;
or even
for filename_to_check in *.root
do
root.exe -b -l -e "auto f = TFile::Open(\"$filename_to_check\"); if (f == nullptr || f->IsZombie()) { cout << \"There is a problem with the file: $filename_to_check\n\"; exit(1); }”
done;
I have no words to thank you for your efforts. Can you solve me a final question ? What am I doing wrong and the script does not exit properly ? When I use the bash version:
>[14:32:14] bash> cat t.sh
#!/bin/bash
for filename_to_check in *.root
do
# echo -e "I found filename", $filename_to_check ;
root.exe -b -l -e 'auto f = TFile::Open(\"$filename_to_check\"); if (f == nullptr || f->IsZombie()) { cout << \"There is a problem with the file: $filename_to_check\n\"; exit(1); }'
done ;
what I get is an infinite root loop which I cannot exit from.
> [14:33:14] bash>./t.sh
root [0]
root (cont'ed, cancel with .@) [1].q
root [0]
root (cont'ed, cancel with .@) [1].@
root [2] .q
root [0]
root (cont'ed, cancel with .@) [1].@
root [2] .q
.root [0]
root (cont'ed, cancel with .@) [1].q
root [0]
root (cont'ed, cancel with .@) [1]
I tried to change the exit(1) to sys.exit() but it did not work…
Many many thanks for your kind reply and your interest !
As Sergey pointed out, the -q is necessary but also (see your favorite shell script manual) using single quote around the string means that escaping the double quote is no longer needed but also means that shell variable are not expanded … i.e you can not use single quote in this case. Please try
#!/bin/bash
for filename_to_check in *.root
do
# echo -e "I found filename", $filename_to_check ;
root.exe -b -l -q -e "auto f = TFile::Open(\"$filename_to_check\"); if (f == nullptr || f->IsZombie()) { cout << \"There is a problem with the file: $filename_to_check\n\"; exit(1); }"
done ;
Also make sure that during the copy/paste the double quote " are not changed to the fancy kind: “
I hate to bother you again. However, despite the “control” that passed the test, I now get this message in the hadd -f206 that was not detected by the “sanity check”. Is there anyway to test also this ?
Error in <TFile::ReadBuffer>: error reading all requested bytes from file Pion_v44_Nikosbeam_250.GeV_5379.root, got 0 of 300
Error in <TFile::Init>: Pion_v44_Nikosbeam_250.GeV_5379.root failed to read the file type data.
Error in <TFileMerger::OpenExcessFiles>: cannot open file Pion_v44_Nikosbeam_250.GeV_5379.root
Error in <TFileMerger::Merge>: error during merge of your ROOT files
Seems to be, you have empty file with 0 bytes.
When you run you checker script, you should see same kind of messages:
Error in <TFile::ReadBuffer>: error reading all requested bytes from file Pion_v44_Nikosbeam_250.GeV_5379.root, got 0 of 300
Error in <TFile::Init>: Pion_v44_Nikosbeam_250.GeV_5379.root failed to read the file type data.
Can it be that file “Pion_v44_Nikosbeam_250.GeV_5379.root” simply was not checked?
root.exe -b -l -q -e "auto f = TFile::Open(\"Pion_v44_Nikosbeam_250.GeV_5379.root\"); if (f == nullptr || f->IsZombie()) exit(1);"
Error in <TFile::ReadBuffer>: error reading all requested bytes from file Pion_v44_Nikosbeam_250.GeV_5379.root, got 0 of 300
Error in <TFile::Init>: Pion_v44_Nikosbeam_250.GeV_5379.root failed to read the file type data.
Therefore it must be that the loop just did not break when the error appeared. I will investigate ! Many thanks for the hint!!