Hi,
I have 100 files that are all about 300MB long each. They were filled with (I think) default options running 5.26. Watching my read back I can’t help but wonder if there is some file optimization I can do.
For example, I could re-copy them and optimize the basket size, and contact them together into 2 GB (or larger, I’m not limited by the 2 GB size for what I’m doing).
Is there a simple way to do this? To optimize the basket layout/sizes in the files?
Thanks! This information may be availible in other places, so I appologize if that is the case - just point me to where i can search. As you can tell, I’m experimenting with things that will make my analysis as fast as possible.
Yes, you could rewrite with the file with v5.30 (or a least v5.28) which will write file that works better with the TTreeCache thanks to the new concept of basket clustering (which means that for a range of entries, the TTreeCache can read all it needs in exactly one read). To do so, simply use hadd -O:hadd -O output.v530.root input.root
This failed, but not for a reason I can understand. I’m running with v5.30. Here is the command line stuff:
-bash-3.2$ hadd -O -f merged.root user.Gordon.000217.AANT._00522.root
Target file: merged.root
Source file 1: user.Gordon.000217.AANT._00522.root
Warning in <TClass::TClass>: no dictionary for class AttributeListLayout is available
Warning in <TClass::TClass>: no dictionary for class pair<string,string> is available
Target path: merged.root:/
Target path: merged.root:/HVMSVxToolMonitor
Warning in <TFileMerger::MergeRecursive>: cannot merge object type (n:'TObject', t:'Basic ROOT object') - Merge(TCollection *) not implemented
-bash-3.2$ ls -l user.Gordon.000217.AANT._00522.root
-rw-r--r-- 1 gwatts he_exp 13724 Jul 22 09:59 user.Gordon.000217.AANT._00522.root
-bash-3.2$ ls -l merged.root
-rw-r--r-- 1 gwatts he_exp 13265 Aug 4 08:32 merged.root
-bash-3.2$
So, I don’t see an error message there that tells me why TTree didn’t work. Here is a .ls of the input file:
-bash-3.2$ root -l user.Gordon.000217.AANT._00522.root
root [0]
Attaching file user.Gordon.000217.AANT._00522.root as _file0...
Warning in <TClass::TClass>: no dictionary for class AttributeListLayout is available
Warning in <TClass::TClass>: no dictionary for class pair<string,string> is available
root [1] .ls
TFile** user.Gordon.000217.AANT._00522.root AANT
TFile* user.Gordon.000217.AANT._00522.root AANT
KEY: TDirectoryFile HVMSVxToolMonitor;1 HVMSVxToolMonitor
KEY: AttributeListLayout Schema;1
KEY: TTree LumiMetaData;1 LumiMetaData
KEY: TTree CollectionTree;1 CollectionTree
root [2] .q
The merge mostly succeed except for:[quote]Warning in TFileMerger::MergeRecursive: cannot merge object type (n:‘TObject’, t:‘Basic ROOT object’) - Merge(TCollection *) not implemented[/quote]which says that one of the object (Maybe the AttributeListLayout object) is not mergeable (hadd assumes more than one input file and wants to know how to merge all the inputs).
Ok… my bad. Turns out in a directory full of large root files, I choose one that had no events in it!
I ran it on a 300MB file, and the resulting file is only 179kB. So I still think there is something wrong, but I need to look at it a bit more carefully…
Ok. I’m not getting the compression factor yet. Is there a way I can tell how much data in each file a TKey references? I want do a “ls -l” on the TFile. I know how to do this for a TTree (with Print). But how about a directory? I want to know where that 300 MB is being lost so we can fix the original writing code to not waste so much disk space!
Whew. Ok. So the output merge file is corrupt. Man, thought I was loosing my mind. So there must be that attribute list in my the ntuple somewhere that is causing that failure (a std::map style entry). So, I guess this means I can’t optimize this?
In most case, if the input file is valid, you should be able to re-optimize with hadd. If it fails for a TTree (with any almost content) then it is a (unknown) deficiency and I would need your input file to reproduce it. The rare case where it fails for a TTree are usually due to event model when ROOT can not properly guess the object ownership with the user objects and in this case rather than using hadd, you can use TFileMerger directly and load your libraries.
Okaaayyy… So, first of all, the hadd shouldn’t have failed for the TTree’s due to that error, right? The TTree’s should have done just fine. But the output is corrupt.
In the output file you can see the corrupt by (in 5.28) double clicking on the “Track_pt” link and it will cause root to access violate.
BTW, I ran hadd out of 5.30, but I’m reading it back in 5.28… could that be a problem? The original root file was written by 5.26 if I remember correctly.
We typed past each other. As far as I know this ntuple contains nothign uniue. It does have some 3rd order arrays, etc. But you can see the warning that root and hadd print out when they load the file… so…
Indeed in v5.30/00’s hadd, you can not properly merge just one file and the -O for inadvertently disabled.
This is fixed in the trunk and in the v5.30 patch branch.
To work around the problem you would need to hadd at least two file and force the reoptimization by request a different compression level than the original file.
[quote=“pcanal”]Hi Gordon,
To work around the problem you would need to hadd at least two file and force the reoptimization by request a different compression level than the original file.
[/quote]
Thanks. It now seems to be making some sort of real effort to run now. I’ll continue the thread if I have any problems with the results! Thanks for the fix!
I’ve been experiencing the same problem while merging through hadd the 3 rootfiles in dataset:
mc11_7TeV.144084.Herwigpp_GGM_gl_neut_500_150_susy.merge.NTUP_SUSY.e1004_s1372_s1370_r3043_r2993_p832.root
There were no errors while downloading the dataset from dq2-get
The sum of the 3 rootfile sizes I want to merge matches my dataset size (given with a dq2-ls -f query)
[quote]- My output rootfile is 3,134.516 MB instead of 3,137.434 MB[/quote]By itself, this is not necessarily a problem (as the difference could be accounted for by better compression and/or less gap in the file).