Compression in memory?

petercogan · February 3, 2012, 7:17pm

Hi,

I have a data class myData which inherits from TObject. I write my data to a tree and store it in a root file which takes up ~250MB of disk space.

I need to access the data in a very random way, so I can’t use a TTree since TTree::GetEntry is very slow at random access. Instead I use a map, i.e. map<uint64_t, myData> to hold the data in memory - this also gives very fast random access.

The problem is that when I load up my map I use about 10GB of memory. So we can safely say that the built-in compression for TTrees is doing a fantastic job

So finally my question: Does root provide a method I could use to take advantage of those nice compression routines for my data in memory? So instead I would have a map<uint64_t, compressedData>. I assume each time I access the compressed object I would have to decompress it before I could do anything with it.

thanks for any suggestions

Peter

pcanal · February 3, 2012, 8:54pm

Hi Peter,

[quote]Does root provide a method I could use to take advantage of those nice compression routines for my data in memory?[/quote]ROOT only provide the lower level parts that would be needed (and the higher level use (TTree/TFile)). To see how to use R__zip and R__unzip, take a look at the implementation of TBaskets. However I do not recommend it as you would end duplicating most of the functionality.

Another way to get this behavior from a TTree is to use a combination of SetAutoFlush(1) [at write time] and LoadBaskets.

Calling SetAutoFlush(1) on the TTree when writing it would insure that each basket contains only one entry (and thus that when decompressing you decompress only one entry at a time) [this is done at an increase cost of size (less compression and more baskets in total].

Calling LoadBasket on the TTree when reading will make sure that all the compressed data is loaded in memory and decompressed (and un-streamed) only when GetEntry(id) is called.

Cheers,
Philippe.

petercogan · February 3, 2012, 9:01pm

Hi Phillipe

thanks for the reply - but in this solution I am still using a TTree right so random access performance will be poor? I understand TTrees are optimized for sequential…

thanks
Peter

pcanal · February 3, 2012, 9:19pm

Hi,

The ‘reason’ that TTree has poor random access performance are removed by the combination of the use of SetAutoFlush(1) and LoadBaskets (but at other cost).

The main reason for ‘poor’ random access performance in a TTree in that a random read in the usual case imply:
a) drop the in-memory basket (assuming it is not the basket we need)
b) read the basket from disk
c) uncompress the basket that contain the request entry (and many more entries).
d) unstream the data/object.

In the usual case there are many entries in each basket and thus in a random read pattern you would read each basket from disk many times and decompress them many times.

By reducing the number of entries in a basket (SetAutoFlush(1)), you would reduce the amount of duplicated work to zero. However the cost is a (possibly) dramatic increase in file size due to the increase in number of baskets and the reduction in compression factor. You should really only use this if the size of each entry (in most branches) is quite large (I.e. you may have to decrease the spit level also).

The LoadBaskets make sure that every single basket is loaded in memory at the beginning each bunching all the I/O at the same time (the start) and implementing what you were describing [It is needed to avoid duplicate I/O when using the TTreeCache which reads many baskets at once in order to improve I/O performance in the sequential case].

Cheers,
Philippe.

petercogan · February 7, 2012, 6:48pm

Hi Philippe,

ok thanks for the explanation - I am going to experiment to see what difference it makes.

cheers
Peter

petercogan · February 7, 2012, 9:26pm

Hmm - ok I tried it but after an hour it was still writing out the tree and has used several x more disk space. Might have a look at the other unzip idea.

thanks again