TFile compression - how good is it?

ardashev · February 5, 2004, 12:02am

I have PAW ntuple with a lot of small integers ( for example 0 < adc < 2000), that could easily fit into Short_t ( that goes up to 32000). But when I converted the ntuple into root file I ended up with bunch of integers.

Here is my question:

is TFile compression smart enough to reduce amount of disk space needed for those integers or

should I hack h2root to manually convert type of appropriate variables from Integer into Short_t ?

I have tried to set compression level to 9 but the type did not change.

I would imagine that h2root should have an option to " downgrade types of variables if their range allows.

Of course, if one has many ntuples ( same data set though) to convert then he wouldn’t want same variable to have different types in different parts of data. But that can be also taken care of by specifying somehow which variables to downgrade and which not.

Again, all of this is irrelevant if current compression algorithms are smart enough. Are they ?

brun · February 5, 2004, 9:32am

If your variables are integers, teh compression algorithm will compress them efficiently. Do need to convert them to short.

Rene

ardashev · February 5, 2004, 5:59pm

Would conversion from integer to short improove speed noticeably?

Has anyone thought about conversion of variable types in h2root ?

I could try hacking h2root and compare performance but good advice beforehand would be appreciated.

Thanks for reply, Rene.[/b]

brun · February 5, 2004, 6:43pm

Going from int to short will not improve the I/O performance.
It may improve the memory management and the speed of execution
in case you navigate in large data structures of shorts instead of ints.

What is the problem with h2root? Are the variable declarations in the
original Hbook ntuple wrong?

Rene

ardashev · February 5, 2004, 10:06pm

original ntuples contain integers because hbook does not support
short type at all. Smallest integer is 4 bytes long. For integers that have small range ( 0-2000) I can specify the number of bits to use when packing into file, but that’s it. I cannot specify type to be short. h2root converts integers into integers irrespectively of how many bits are used in the original ntuple.

I wish h2root would look at how many bits are actually used for a particular integer and if it is smaller than 16 then use short instead of integer type.

I don’t know if this information can be extracted from hbook ntuple w/o scanning variables in the whole file. If it is not possible for h2root to extract packing information from hbook ntuple then I would like to be able to specify
list of variables for which type overide ( conversion) has to be done.

I am trying to improve perfomance of my application. I have events with mostly integers 0-2000 and if I could get integers replaced with shorts then I would expect some speedup. Maybe negligible, but I dont know.

brun · February 6, 2004, 8:00am

One could imagine to modify h2root to process the integers that have been declared to fit in 8 bits to a char type and ones fitting in 16 bits
to a short type. However, this would not be backward compatible.
An option to h2root could be added. The gain in space will be NULL
because the compression algorithm is far more clever than the old packing technique in Hbook/PAW consisting in packing small integers
into a few bits.
As I said in my previous mail, you can gain some performance in memory
if most of your arrays can fit in a short. You will optimize the use of the cache.
If you want to add the option to h2root, I will add your code to it.
I have no time now to help implementing this feature.

Rene

ardashev · February 19, 2004, 8:54pm

I have modified h2root to put integers of <=16 bit into shorts
for cwn ntuples. hntvar2.f needed one more agrument for that.

Test on my data ntuple gave me 10% savings in disk space and therefore on speed also since I need to do less I/O now.

Attached are h2root.cxx and hntvar.f.

Replace those in your root installation, compile and use new option
[optcwn] to turn on savings.

A little bit puzzling though why compression of integers is not that good as Rene expected in previous reply?

For interested folks : there is also ntuple that I played with.
optcwn.tar.gz (1.36 MB)

brun · February 19, 2004, 9:54pm

Thanks for implementing this optimisation. I have put your code in CVS.
Note that I had to modify the class THbookFile that calls hntvar2.
The result of the exercise is the following with your file:

file.hbook 3284992 bytes
file.root 1977073 bytes (before your change)
file.root 1823861 bytes (with your changes).

so a gain of 8.4 per cent

Rene

brun · February 19, 2004, 10:30pm

Note that if you increase the buffer size, you can still gain a bit,
eg with buffers of 32000 bytes
h2root file.hbook file.root 1 1 0 32000
you generate
file.root 1736726 bytes instead of 1823861

Rene

ardashev · February 23, 2004, 4:55pm

Hi, it is not as simple as I thought it is.

Conversion is wrong because Branch thinks that elements of bigbuf[] are integers and not shorts, I am working on fix. Also, adding support for int -> char conversion. Rene, please make optcwn = 0 default for now.

people might get screwed up since it is really not backward compatible and is wrong right now.

I guess I will make temporary buffers of chars and shorts specifically for converted variables and fill them with type-changed variables before feeding into TBranch.
I will check carefully the data ( should have done it initially) before posting next fix.

Sorry for mishap, Khamit

ardashev · February 23, 2004, 5:28pm

I am a little puzzled here.
seems like I dont see error at all. But data is screwed up.
bigbuf is not to blame since it is already an array of chars.
Perhaps, TBranch does not handle shorts and chars properly.
Checking…

ardashev · February 23, 2004, 5:47pm

ok, it is bigbuf that needs to be tweaked.
hbook writes 4 byte integers in arrays, but TBranch reads 2 byte shorts or 1 byte chars. So, TBranch makes 2 shorts out of 1 integer.

I will make temp buffers of shorts or chars and it should be fine then.

Dont use optcwn in h2root in CVS now !

ardashev · February 24, 2004, 12:24am

fixed h2root.cxx - works fine now.

INT of < 1 byte go into char
< 2 byte into short

branch buffer size set to 64K - really makes files smaller.

all in all, for my data ( mostly small integers) , 15% saving in disk space and i/o.

Somebody knows how to easily salvage bits also ? Not just bytes ?
Must be a way to reduce file size even more since I know range of variables. Like in ntuples.
h2root.cxx (24.6 KB)

ardashev · February 25, 2004, 8:00pm

I have processed more of data sets and found that
new h2root is first of all correct ( gives same results) and also makes files
considerably smaller. One old file was 1.8 Gb , became 1.4Gb. Incredible.
Of course, this savings come into play only for data that has a lot of small integer arrays.

One should really use branches of shorts and chars instead of integers to save disk space. Compression of integers (even at compresion level 9) does not seem to be as good as it can get.

It seems that one could revisit compression algorithms and make root files even smaller.

Enjoy. Khamit