Corrupted root files after edit in UPDATE R__unzip: error in

Hi,

For our experiment’s DataQuality Monitoring, we monitor raw data files in numerous small jobs creating a large histogram root file in the process. We then merge all of these root files from the individual jobs into a single file. After this, we may make efficiencies, fits, etc. on the final merged root file. In order to do this we open the root file w/ mode ‘UPDATE’ and either over-write keys via ‘obj->Write("", TObject::kOverwrite);’ or just make completely new histograms and save them to the root file. Each detector sub-system, trigger, physics object may have their own ‘post-processing’ alg that will open the final root file in ‘UPDATE’ mode. So in the end this final root file is opened in UPDATE mode and edited maybe 20 times w/ perhaps 1000’s of TObject edits.

There is a low rate at which the file becomes corrupted(don’t know this rate). It is extremely difficult to tell which algs are the culprits and the affected histograms on the final file(R__unzip errors) may belong to a non-offending algorithm. We can sometimes find solutions, but the reasons for the solutions are never satisfying.

So I wanted to see if I could make a short script illustrating the problem on a generic root file with many histograms. Thus I have attached a rather perverse script that can recursively loop through a root file histogram and for any object that is castable to at TH1, I will trivially modify the histogram and then save it back to the file in its original directory via ‘TDirectory::cd(); obj->Write("",TObject::kOverwrite(also kWriteDelete))’ and then see if I can get any of these R__unzip errors.

Maybe this script is not the best example. But it seems to be able to show that opening files in UPDATE mode and modifying objects, if done in excess can cause problems. So here is how to run the script:

cp $ROOTSYS/tutorials/io/dirs.C .
root -q -b dirs.C
root -q -b test2.C+
./scanFile.py test.root >/dev/null #my mac makes the file Test.root but linux does what I want

Depending on the value of m_copy_max, a variable in the script that controls how many TH1s to overwrite, one can see these R__unzip errors come from TKey::ReadObj(). On lxplus slc5 root5.30

sourced directly from here:

/afs/cern.ch/sw/lcg/app/releases/ROOT/5.30.00/x86_64-slc5-gcc43-opt/root

the above commands work. On my Mac test.root --> Test.root, and ./scanFile.py --> python scanFile.py

Anyways hopefully you see a problem w/ my script, can give a workaround–such as copying the file after each subsquent copy to remove any dead space in the file from deleting object, or find something w/i root?

Thanks,
Justin
scanFile.py (2.29 KB)
test2.C (1.75 KB)

Hi,

So the issue with the script I posted, as somebody pointed out to me, is that one must not delete TDirectory objects after calls such as TKey::ReadObj(). Hopefully this is the cause of our problems.

Cheers,
Justin

Hi Justin,

Yes, deleting a TDirectory (that belongs to a TFile) will result in unexpected behavior.

Cheers,
Philippe.

Hi Philippe,

So it makes sense to me that deleting a TDirectory before the closing of a file would give undefined results. Is there a way that you could make root let the user know that they are doing something potentially fatal? Could be very helpful in large applications.

For example in ~TDirectory(), print a WARNING or ERROR if this is not a TFile, and this TDirectory is a child of a TFile that is opened in some Writeable format. Then when ~TFile deletes its sub-directories it can somehow suppress these messages.

I realize that the user shouldn’t be doing this delete anyhow, but when numerous people work on a large application and person A deletes a TDirectory in mode UPDATE that affects person B’s histograms bug tracing can be very difficult.

Thanks,
Justin

Hi Justin,

Unfortunately in C++ there is no easy way to know the caller of the destructor (short of using global variables) and there is also legitimate use for TDirectory that are not part of a TFile and in this case one might need to delete them.

Cheers,
Philippe.

Hi Philippe,

Sure there are already enough global variables. Anyways I think I have found some undefined behavior in my code that essentially is doing this:

TIter itr(f->GetListOfKeys());
TKey* key0 = dynamic_cast<TKey*> (itr());
TDirectory* dir0 = dynamic_cast<TDirectory*> (key0->ReadObj());
TDirectory* dir00 = dynamic_cast<TDirectory*> (key0->ReadObj());

When I cout dir0 and dir00, I see two distinct memory addresses so this same TDirectory is over-allocated. Could this cause advesre results? When I removed the extra ‘ReadObj()’ calls, my code seems to produce root-files in a good state.

Thanks,
Justin

[quote]When I cout dir0 and dir00, I see two distinct memory addresses so this same TDirectory is over-allocated. Could this cause advesre results?[/quote]Yes, it would. There is now 2 objects in memory that thinks they represent the same directory (on file) and unless there are 100% kept in sync, bad thing will indeed happen.

Cheers,
Philippe.