Too big root files

Hi to everybody.

I have a framework (detalis I thik are not interesting but however you can look here:
[url]Features I added to ROOT looping over a TChain and producing, on a .root file some histos (TH1F and TH2F) and some TTrees. BEFORE the ending of the process the TTrees are deleted and only the histos were left on the file.

The size of the file, during the processing, is obviously increasing with the number of events processed (or better with the number of events filled into the TTrees…) but I would like to have this space “gained” again when the TTrees are deleted, but it’s ot so…

The total size (apart compression!!!) of histos on this file (I calculated this starting from the total number of bins and multiplying it by the size of the type used [F] and by 2 since I’m using Sumw2(). This calculation is not perfect since doesn’t consider the “overhead” of each histo but is neither so far…) is something like 500MB but I obtain, in some cases (and when I run over a large data set) files of 1GB or even 10GB…

Since the size of the files is not decreasing during the tree removal phase I thought to a problem similar to the fragmentation of a file system or to the sparse files and I tried this:

hadd new.root MC.bkp2.root

and then also

hadd -f9 newf9.root MC.bkp2.root

where MC.bkp2.root is a file as described above and with a size of 1.1GB.

As I supposed this operation (in principle a stupid and useless operation that has to be similar to the copy of the file…) decrease the size of the file bringing it to a more “normal” one. Below you can look the size of the files:

-rw-r--r--  1 bozzo  staff   1,1G 17 Gen 17:44 MC.bkp2.root
-rw-r--r--  1 bozzo  staff    24M 17 Gen 17:51 new.root
-rw-r--r--  1 bozzo  staff    11M 17 Gen 17:53 newf9.root

What can I do?! Is a known problem?! There’s some mistake made by me?!

Thanks,
Matteo

Hi,

By design, a TFile never shrinks. You do the right thing: open a new file, copy the histograms there, delete (as in unlink) the old one.

Cheers, Axel.

I understand well the choice to have the TFile not shrinking automatically but I was hoping in a TFIle::Shrink() method to be used before TFile::Write() and TFIle::Close(), if needed…

In fact I’m doing what you said, essentially, with hadd, without the need to write additional code…

Thanks for the reply!
Matteo

Dear Axel,

I have never understood the decision that TFile does not shrink when you delete one or more objects. Especially, if you have a TFile with many hundred trees and want to delete a couple of them, then it should be possible to have at least a command which shrinks the TFile.

As you might know I am also working with R and tend to compare ROOT and R. The R analog to TFile is called “Rdata” which holds all R objects. Like TFiles “Rdata” files are OS independent and can be copied to different machines. “Rdata” files can also become many GB in size just as TFile. However, when opening “Rdata” files with R it is possible to delete objects. R does even have a command , gc(), to do the garbage collection. When you then close the “Rdata” file it has shrunk to the size of the remaining objects.

I must admit that I consider this to be a design flaw of TFile, and would really appreciate if there would be a command like TDirectory::Shrink() or TDirectory::gc().

Best regards
Christian

Hi Christian,

In the first order, the shrink operation would easily involved rewriting the file from scratch (you would need to move the active part inside the ‘free’ parts) and would usually result in a sub-optimal size of the file (some of the gap may not be large enough to fit the active parts). In addition the TTree objects contains direct references to where its baskets are on the file and thus all TTree objects would need to be rewritten/updates. All in all, it is currently simpler and more efficient to simply copy the file with gaps into a new file. If you have only histograms and TTree, for example the following command does the shrink:hadd -f shorter.root larger.root; mv shorter.root larger.root

Cheers,
Philippe.

Thanks Philippe,
I’m using exactly your procedure but I would prefer the use of a TFile method inside my C++ program…

However ok, is not a big problem…

Cheers,
Matteo

Hi,

Not quite what you want but there is also a quasi equivalent of hadd in compile code: TFileMerger.

Cheers,
Philippe.

Dear Philippe,

I understand that in the short term it is not possible to change the design. It is also clear to me that in principle you face a problem similar to “disk fragmentation”, although (as far as I understand) this is mainly a MS Windows problem and not so much a problem on Unix/Linux systems. Nevertheless, in the long term this would be the only “elegant” solution :slight_smile:

Regarding your suggestion to use hadd I am not sure if this would work in my case since I have TFiles with TTrees in one TDirectory and other TObjects in another TDirectory. Even if I would only have TTrees in TDirectories I am not sure if hadd would work.

Best regards
Christian

It’s not precisely a “fragmentation” problem but a “file sparse” one, in my understanging of the origin: doesn’t matter where the file system puts physcally on disk the segments of the file but the “index” of each KEY in the TFile that, if I understand well, cannot be changed so easily…
And this happen, probably, indipendently from the OS and from the file system.
I had this issue with OSX writing on a case-sensitive HFS+ partition. I can test it easily on a case-insensitive one and on Linux with Ext3…

“hadd” surely solves the problem (I also have many TDirectories with many trees and histograms).
hadd normally take the first source and copy the directory structure into the target. Then for each “allowed” (histograms and trees) TObject try to merge all the “same” one (the one with the same name inside Tdirectory with the same name) and put the result into the target. With only 1 source it is the same to copy each TObject from the source to the target (but without the “blank” space between them…)

Cheers,
Matteo

Dear Matteo,

Thank you for your clarification.

Nevertheless, in my case hadd does not work since I am using in addition my own classes

188-23-79-201:xps-1.x.x$ hadd -f tmp_move.root tmp_QualTest3.root                                      
Target file: tmp_move.root
Source file 1: tmp_QualTest3.root
Warning in <TClass::TClass>: no dictionary for class XPosition is available
Warning in <TClass::TClass>: no dictionary for class XTreeInfo is available
Warning in <TClass::TClass>: no dictionary for class XFolder is available
Warning in <TClass::TClass>: no dictionary for class XTreeSet is available
Warning in <TClass::TClass>: no dictionary for class XTreeHeader is available
Warning in <TClass::TClass>: no dictionary for class XResidual is available
Warning in <TClass::TClass>: no dictionary for class XQCExpression is available
Warning in <TClass::TClass>: no dictionary for class XExpression is available
Warning in <TClass::TClass>: no dictionary for class XExpressionTreeInfo is available
Warning in <TClass::TClass>: no dictionary for class XBorder is available
Warning in <TClass::TClass>: no dictionary for class XBordTreeInfo is available
Warning in <TClass::TClass>: no dictionary for class XGCProcesSet is available
Warning in <TClass::TClass>: no dictionary for class XPreProcesSet is available
Warning in <TClass::TClass>: no dictionary for class XProcesSet is available
Target path: tmp_move.root:/
Found subdirectory QCSet
Target path: tmp_move.root:/QCSet
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XTreeInfo) to store in a compiled pointer (TObject)
tmp_QualTest3.root tree:QCSet/TestA1_raw.res entries=1234567890
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XTreeInfo) to store in a compiled pointer (TObject)
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XExpressionTreeInfo) to store in a compiled pointer (TObject)
tmp_QualTest3.root tree:QCSet/TestA1_raw.rlm entries=1234567890
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XExpressionTreeInfo) to store in a compiled pointer (TObject)
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XTreeInfo) to store in a compiled pointer (TObject)
tmp_QualTest3.root tree:QCSet/TestA2_raw.res entries=1234567890
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XTreeInfo) to store in a compiled pointer (TObject)
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XExpressionTreeInfo) to store in a compiled pointer (TObject)
tmp_QualTest3.root tree:QCSet/TestA2_raw.rlm entries=1234567890
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XExpressionTreeInfo) to store in a compiled pointer (TObject)
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XTreeInfo) to store in a compiled pointer (TObject)
tmp_QualTest3.root tree:QCSet/TestB1_raw.res entries=1234567890
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XTreeInfo) to store in a compiled pointer (TObject)
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XExpressionTreeInfo) to store in a compiled pointer (TObject)
tmp_QualTest3.root tree:QCSet/TestB1_raw.rlm entries=1234567890
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XExpressionTreeInfo) to store in a compiled pointer (TObject)
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XTreeInfo) to store in a compiled pointer (TObject)
tmp_QualTest3.root tree:QCSet/TestB2_raw.res entries=1234567890
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XTreeInfo) to store in a compiled pointer (TObject)
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XExpressionTreeInfo) to store in a compiled pointer (TObject)
tmp_QualTest3.root tree:QCSet/TestB2_raw.rlm entries=1234567890
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XExpressionTreeInfo) to store in a compiled pointer (TObject)
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XBordTreeInfo) to store in a compiled pointer (TObject)
tmp_QualTest3.root tree:QCSet/TestA1_raw.brd entries=1234567890
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XBordTreeInfo) to store in a compiled pointer (TObject)
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XBordTreeInfo) to store in a compiled pointer (TObject)
tmp_QualTest3.root tree:QCSet/TestA2_raw.brd entries=1234567890
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XBordTreeInfo) to store in a compiled pointer (TObject)
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XBordTreeInfo) to store in a compiled pointer (TObject)
tmp_QualTest3.root tree:QCSet/TestB1_raw.brd entries=1234567890
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XBordTreeInfo) to store in a compiled pointer (TObject)
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XBordTreeInfo) to store in a compiled pointer (TObject)
tmp_QualTest3.root tree:QCSet/TestB2_raw.brd entries=1234567890
Error in <TBufferFile::ReadObject>: trying to read an emulated class (XBordTreeInfo) to store in a compiled pointer (TObject)
Error in <TBufferFile::CheckByteCount>: object of class TNamed read too few bytes: 18 instead of 1278
Error in <TBufferFile::ReadObject>: object tag too large, I/O buffer corrupted
Error in <TBufferFile::CheckByteCount>: object of class TFolder read too few bytes: 1289 instead of 1294
Cannot merge object type, name:  title: 
188-23-79-201:xps-1.x.x$ 

Best regards
Christian

Hi Christian,

You could improve the situation by allowing hadd to load your library simply by creating the corresponding rootmap file (see the executable rlibmap)

Cheers,
Philippe.

Ah ok…

If your classes are deriving from TH1 or from TTree could be sufficient to compile “hadd” with your libraries linked. “hadd” will call TH1::Merge() or TTre::Merge() and of your classes are not so different w.r.t. to the mothers you will be ok… Otherwise you have to write your own Merge() method and, again, recompile “hadd” with your libraries linked.

If your classes are not deriving from them you have to put the hands into the hadd or probably stay with the problem…

Cheers,
Matteo

The solution of Philippe probably is more adeguate…

Dear Matteo, dear Philippe,

Thank you for your suggestions, the rootmap file may be a good solution.

Best regards
Christian