Can one check integrity of a TTree written into a file?

vaurynovich · May 27, 2015, 5:36pm

Hello Everybody,

I was wondering if it is possible to check integrity of a TTree written into a file?

Quite often, I find out while trying to analyze entries of a tree (which is written into a root-file), that the entries are corrupted. And I can only see it when I actually read the tree entries for analysis (just to open a file and get a pointer to the tree is not enough). Also, such problematic trees can do something crazy (like making ROOT to allocate ~12GB of memory just for reading, while the same trees in non corrupted files only require ~70 MB of RAM). The errors I see look like these (4 types of errors):

Error in <TBasket::Streamer>: The value of ...
Error R__unzip_header: error in header
Error in <TBranch::GetBasket>: ...
*** glibc detected *** root.exe: malloc(): memory corruption:

I would really love if it was possible to detect such problems early: i.e. open a file, get a pointer to a tree and call some function which would check all the TTree entries for integrity instead of catching such corrupted files one by one during analysis.

Thank you,
Siarhei.

pcanal · May 29, 2015, 4:21pm

Hi,

Not directly. Assuming the file is produced correctly, you can record the result of TMD5::FileChecksum and cross-check the value when reading the file.

Cheers,
Philippe.

vaurynovich · June 3, 2015, 5:31pm

Hi Philippe,

I think it is exactly the problem: the files are not written correctly to begin with
(it not the case that they are written correctly but then get corrupted). I write rather large trees to a remote node:

TXNetFile file("root://eos...", "CREATE"); 
while(_there_are_more_events_) tree->Fill(); 
file.Close();

After several attempts to process the data, the files do get created correctly eventually. Also, it seems to help to write files locally (i.e. on AFS) first, and then transfer them to the remote storage (i.e. CERN EOS).

Maybe, there is some “safer” way to write to remote nodes (i.e. to check md5 sums of each piece of data to be written locally and then remotely and retry writing if the two sums do not agree)?

Thank you,
Siarhei.

pcanal · June 3, 2015, 5:39pm

Hi,

I think it is exactly the problem: the files are not written correctly to begin with

Then, the real problem/solution is to get proper feedback from the remote site (CERN EOS) that the data transfer failed. If for some reason we can not achieve this, I agree that a solution would be to write to a local file (I would not use afs though but a locally attached disk), take its md5 and then upload it.

Cheers,
Philippe.

PS. Is it also plausible that there is a problem with the code? I.e. does the file sometimes get corrupted even when writing to a local disk?

vaurynovich · June 3, 2015, 7:22pm

Hi Philippe,

[quote]Then, the real problem/solution is to get proper feedback from the remote site (CERN EOS) that the data transfer failed.[/quote] That would be great! This is exactly what I was hoping to be able to do somehow automatically with some auto-correction mechanism. Maybe, it would be slower but it is still would be better than recreating the same files over and over again (which is terrible waste of time).

[quote]a solution would be to write to a local file (I would not use afs though but a locally attached disk)[/quote] This is hard to implement since I do not have access to LSF batch nodes which process my data.

[quote]Is it also plausible that there is a problem with the code? I.e. does the file sometimes get corrupted even when writing to a local disk?[/quote] I do not have enough statistics to be sure. Here are the facts from my recent experience:
[ul]
[li] after several attempts, most of files were created correctly (I did not even recompile my executable, just restarted it a few times)[/li]
[li] a few stubborn files (it seemed like they corresponded to especially long runs) kept being created corrupted, and so after a few attempts I just started the same executable as an interactive process (no LSF) with a local (AFS) directory used for output and all of the files got created correctly in the first attempt[/li]
[li] not all of the tree entries are corrupted: most of them are read fine during analysis, but some entries in the middle produce large number of errors, and so it seems that only some branch buffers are written incorrectly[/li]
[li] in another executable, I fill trees (including large ones) first, and then save them (as one large buffer) and it never failed so far[/li][/ul]
The facts seem to suggest that the probability of writing a corrupted tree increases with the number of chunks of data sent for writing to a remote node.

But the conclusion is still the same: any way to check integrity of a tree (during writing or after the fact) would be very useful.

Thank you for your answers,
Siarhei.

dhsmith · June 4, 2015, 8:23am

Hi Siarhei,

I’m going to follow up and aim to reproduce the problem and either fix or check with the remote site to see what can be done (depending on where exactly it appears the problem happens). In the mean time I think that writing the file to a scratch space on the batch node and then uploading it to EOS when it is complete would be the best option for you to adopt for now.

[… about writing to a local file …]

Usually batch systems provide a scratch space for your job to use for temporary output files. This is the case for the cern LSF batch nodes, where the job is started in a directory which is considered as scratch space:

information-technology.web.cern. … tch-system

so the idea would be to write the root file(s) to directory that the job is started in and then copy them to EOS in the batch job, but after the root files have been fully written. e.g. by using the xrdcp command in a shell script:

#!/bin/bash

# the run the executable to make the root files
# and which writes the output to the current directory
./myTreeProducer.exe
ret=$?
if [ $ret -ne 0 ]; then
  echo there was an problem running the application to write the root files
  exit $ret
fi

# copy the output from the local disk to the remote storage
xrdcp myoutput1.root root://eos...//the_path/myoutput1.root
ret=$?
if [ $ret -ne 0 ]; then
  echo there was a problem while uploading the root file output to eos
  exit $ret
fi

# job successful
exit 0

Do you think you could try this in your environment?

For reproducing the problem, if you think that you could provide a script or program source which writes output remotely and shows the problem, that would be ideal. However if that is not very convenient for you to give (e.g. because of dependencies on your working environment or complexity of the programs) I can try to reproduce it without that.

Yours,
David

vaurynovich · June 4, 2015, 10:14am

Hi David,

Thank you so much for the tip!!! I will certainly try this approach the next time I need to process some data.

I do not have any problem with providing the source code and all the instructions on how to compile/run it, but indeed the program is quite large and complex, most of the code of this program was written not by me, and the program has external library dependencies. I am guessing you would not want to dive into it, but if you feel brave, please let me know and I will provide you with all the details.

However, I have just checked and I have 3 corrupted root files left, which were lucky to not get deleted. If it would be of any help, I could give access to them. But here are a few full error messages which I get while trying to read the files:

Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-5998) ; trying to recover by setting it to zero                                                 
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-399321558) ; trying to recover by setting it to zero                                             
Error in <TBranch::GetBasket>: File: root://eosams.cern.ch//eos/ams/user/v/vaurynov/microDST9_v5/microDST9_v5_nominal.exe/hide/eos.filelist_microDST9_v5_nominal.exe_341_3.root at byte:-4842704464808226704, branch:tk_hit_coo, entry:884, badread=1, nerrors=1, basketnumber=3                                                                
Error in <TBranch::GetBasket>: File: root://eosams.cern.ch//eos/ams/user/v/vaurynov/microDST9_v5/microDST9_v5_nominal.exe/hide/eos.filelist_microDST9_v5_nominal.exe_461_2.root at byte:-5353481190596357856, branch:ev_mfcorT, entry:3988, badread=1, nerrors=10, basketnumber=1                                                               
 file probably overwritten: stopping reporting error messages                                                                                                           
Error in <TBasket::Streamer>: The value of fKeylen is incorrect (-5998) ; trying to recover by setting it to zero                                                       
Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-399321558) ; trying to recover by setting it to zero                                                  
Error in <TBasket::Streamer>: The value of fNbytes is incorrect (-1922245779) ; trying to recover by setting it to zero                                                 
Error in <TBasket::Streamer>: The value of fNevBufSize is incorrect (-928154811) ; trying to recover by setting it to zero

Thank you again for your advice!
Siarhei.

dhsmith · June 4, 2015, 12:10pm

Hi,

I’ll make some tests and try to reproduce the problem. I don’t think I’ll need the corrupted root files as samples, so no need to me give access now. If it doesn’t cause you problem, please keep them aside for a few days, in case it appears they could be helpful.

Please could you let me know which OS was being used, and which version of ROOT you use? (If it’s a standard shared ROOT build, could you let me know the installation path, so I can see exactly which version and type of build was running?)

Thank you,
David

vaurynovich · June 4, 2015, 4:25pm

Hi David,

Thank you very much for looking into it!

[quote]If it doesn’t cause you problem, please keep them aside for a few days, in case it appears they could be helpful.[/quote] Of course, I will keep them!

Scientific Linux CERN SLC release 6.6 (Carbon)
Linux version 2.6.32-504.16.2.el6.x86_64 (mockbuild@lxsoft14.cern.ch) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC) ) #1 SMP Tue Apr 21 21:44:51 CEST 2015
the one which is installed on lxplus6.cern.ch nodes.

/cvmfs/ams.cern.ch/Offline/root/Linux/root-v5-34-9-gcc64-slc6

Please, let me know if I could be of any more help in your tests,
Siarhei.