File optimization

gwatts · August 3, 2011, 5:56am

Hi,
I have 100 files that are all about 300MB long each. They were filled with (I think) default options running 5.26. Watching my read back I can’t help but wonder if there is some file optimization I can do.

For example, I could re-copy them and optimize the basket size, and contact them together into 2 GB (or larger, I’m not limited by the 2 GB size for what I’m doing).

Is there a simple way to do this? To optimize the basket layout/sizes in the files?

Thanks! This information may be availible in other places, so I appologize if that is the case - just point me to where i can search. As you can tell, I’m experimenting with things that will make my analysis as fast as possible.

Cheers,
Gordon.

pcanal · August 3, 2011, 1:20pm

Hi Gordon,

Yes, you could rewrite with the file with v5.30 (or a least v5.28) which will write file that works better with the TTreeCache thanks to the new concept of basket clustering (which means that for a range of entries, the TTreeCache can read all it needs in exactly one read). To do so, simply use hadd -O:hadd -O output.v530.root input.root

Cheers,
Philippe.

gwatts · August 3, 2011, 2:56pm

Thanks! I will give this a try!

gwatts · August 4, 2011, 3:34pm

This failed, but not for a reason I can understand. I’m running with v5.30. Here is the command line stuff:

-bash-3.2$ hadd -O -f merged.root user.Gordon.000217.AANT._00522.root Target file: merged.root Source file 1: user.Gordon.000217.AANT._00522.root Warning in <TClass::TClass>: no dictionary for class AttributeListLayout is available Warning in <TClass::TClass>: no dictionary for class pair<string,string> is available Target path: merged.root:/ Target path: merged.root:/HVMSVxToolMonitor Warning in <TFileMerger::MergeRecursive>: cannot merge object type (n:'TObject', t:'Basic ROOT object') - Merge(TCollection *) not implemented -bash-3.2$ ls -l user.Gordon.000217.AANT._00522.root -rw-r--r-- 1 gwatts he_exp 13724 Jul 22 09:59 user.Gordon.000217.AANT._00522.root -bash-3.2$ ls -l merged.root -rw-r--r-- 1 gwatts he_exp 13265 Aug 4 08:32 merged.root -bash-3.2$

So, I don’t see an error message there that tells me why TTree didn’t work. Here is a .ls of the input file:

-bash-3.2$ root -l user.Gordon.000217.AANT._00522.root root [0] Attaching file user.Gordon.000217.AANT._00522.root as _file0... Warning in <TClass::TClass>: no dictionary for class AttributeListLayout is available Warning in <TClass::TClass>: no dictionary for class pair<string,string> is available root [1] .ls TFile** user.Gordon.000217.AANT._00522.root AANT TFile* user.Gordon.000217.AANT._00522.root AANT KEY: TDirectoryFile HVMSVxToolMonitor;1 HVMSVxToolMonitor KEY: AttributeListLayout Schema;1 KEY: TTree LumiMetaData;1 LumiMetaData KEY: TTree CollectionTree;1 CollectionTree root [2] .q

pcanal · August 4, 2011, 3:44pm

Hi Gordon,

The merge mostly succeed except for:[quote]Warning in TFileMerger::MergeRecursive: cannot merge object type (n:‘TObject’, t:‘Basic ROOT object’) - Merge(TCollection *) not implemented[/quote]which says that one of the object (Maybe the AttributeListLayout object) is not mergeable (hadd assumes more than one input file and wants to know how to merge all the inputs).

Was anything missing from the copied file?

Philippe.

gwatts · August 4, 2011, 5:12pm

Yes. All the TTree’s. None of them were merged. Hold it, I’m getting one of the ROOT files onto the web so you can check it out if you want.

[Edit] I should mention that the TTree’s are all at the top level fo the ROOT file (as you can see from the .ls).

gwatts · August 4, 2011, 5:17pm

Ok… my bad. Turns out in a directory full of large root files, I choose one that had no events in it!

I ran it on a 300MB file, and the resulting file is only 179kB. So I still think there is something wrong, but I need to look at it a bit more carefully…

gwatts · August 4, 2011, 5:31pm

Ok. I’m not getting the compression factor yet. Is there a way I can tell how much data in each file a TKey references? I want do a “ls -l” on the TFile. I know how to do this for a TTree (with Print). But how about a directory? I want to know where that 300 MB is being lost so we can fix the original writing code to not waste so much disk space!

gwatts · August 4, 2011, 5:40pm

Whew. Ok. So the output merge file is corrupt. Man, thought I was loosing my mind. So there must be that attribute list in my the ntuple somewhere that is causing that failure (a std::map style entry). So, I guess this means I can’t optimize this?

pcanal · August 4, 2011, 5:50pm

Hi Gordon,

In most case, if the input file is valid, you should be able to re-optimize with hadd. If it fails for a TTree (with any almost content) then it is a (unknown) deficiency and I would need your input file to reproduce it. The rare case where it fails for a TTree are usually due to event model when ROOT can not properly guess the object ownership with the user objects and in this case rather than using hadd, you can use TFileMerger directly and load your libraries.

Cheers,
Philippe.

gwatts · August 4, 2011, 5:51pm

Okaaayyy… So, first of all, the hadd shouldn’t have failed for the TTree’s due to that error, right? The TTree’s should have done just fine. But the output is corrupt.

I’ve posted the input and output files here: d0.phys.washington.edu/~gwatts/root/

In the output file you can see the corrupt by (in 5.28) double clicking on the “Track_pt” link and it will cause root to access violate.

BTW, I ran hadd out of 5.30, but I’m reading it back in 5.28… could that be a problem? The original root file was written by 5.26 if I remember correctly.

gwatts · August 4, 2011, 5:52pm

We typed past each other. As far as I know this ntuple contains nothign uniue. It does have some 3rd order arrays, etc. But you can see the warning that root and hadd print out when they load the file… so…

pcanal · August 4, 2011, 8:26pm

Hi Gordon,

Indeed in v5.30/00’s hadd, you can not properly merge just one file and the -O for inadvertently disabled.
This is fixed in the trunk and in the v5.30 patch branch.

To work around the problem you would need to hadd at least two file and force the reoptimization by request a different compression level than the original file.

Cheers,
Philippe.

gwatts · August 4, 2011, 11:04pm

[quote=“pcanal”]Hi Gordon,
To work around the problem you would need to hadd at least two file and force the reoptimization by request a different compression level than the original file.
[/quote]

Thanks. It now seems to be making some sort of real effort to run now. I’ll continue the thread if I have any problems with the results! Thanks for the fix!

ClaireDavid · June 5, 2012, 5:16pm

Hello,

I’ve been experiencing the same problem while merging through hadd the 3 rootfiles in dataset:
mc11_7TeV.144084.Herwigpp_GGM_gl_neut_500_150_susy.merge.NTUP_SUSY.e1004_s1372_s1370_r3043_r2993_p832.root

There were no errors while downloading the dataset from dq2-get
The sum of the 3 rootfile sizes I want to merge matches my dataset size (given with a dq2-ls -f query)
The errors can be read here: trshare.triumf.ca/~cdavid/shared … r_hadd.log
I use root 5.32.03-x86_64-slc5-gcc4.3
I tried the -o option, still errors
My output rootfile is 3,134.516 MB instead of 3,137.434 MB

Thank you in advance for your help,
Regards,

Claire

pcanal · June 5, 2012, 8:01pm

Hi Claire,

[quote]- My output rootfile is 3,134.516 MB instead of 3,137.434 MB[/quote]By itself, this is not necessarily a problem (as the difference could be accounted for by better compression and/or less gap in the file).

What, if anything, is missing in the output file?

Cheers,
Philippe.

ClaireDavid · June 6, 2012, 7:07am

Hi Philippe,

Thank you for your reply.

I inspected the file and I haven’t seen anything wrong/missing so far. It contains the 25000 events of the dataset.

Should I worry about the errors then, at least for the moment?

Regards,

Claire

pcanal · June 11, 2012, 6:13pm

hi Claire,

A priori no, you do not have to worry (in your case) about those warnings.

Cheers,
Philippe.