Compression of empty std::vec in TTree

fjonasALICE · February 7, 2023, 3:17pm

Dear experts,

I have use a TTree to store skimmed collisions events for my physics analysis, where each entry of the tree corresponds to one event. Two types of branches are used. Flat branches that contain only one float per event, e.g.:

fAnalysisTree->Branch("Event_Rho", &fBuffer_EventRho,"Event_Rho/F");

In addition, there are branches that contain std::vectors, where each entry of the vector corresponds to one particle from the event, e.g. for the energy:

fAnalysisTree->Branch("Cluster_E","std::vector<Float_t>",&fBuffer_ClusterE);

My code works as intended, however, I have a question regarding the compression. Since I do quite a bit of pre-selections, most of the events will be empty. i.e. you will always have something stored for the event (e.g. the aforementioned rho variable), but usually all the std::vector<Float_t> will be of size 0. To give you an example, running over 160k tree entries, only 64 events will have vectors that do not have size()=0.

The problem I encounter is this: The size of the tree gets quite large. While everything is as expected for the flat Float_t branch, I think even for the cases where the vectors have size()=0, the tree has to store the empty std::vector structure, which takes up more space than a simple float.

Do you see any way how I can make the storage more efficienct, i.e. dealing with the vast amount of entries where all the std::vector branches will be empty? Is there a way to “tell” the TTree to compress empty vectors in some way? Can I maybe simply store a nullptr if the vectors are empty and deal with this later when reading the tree? Should I get rid of std::vectors in my tree entirely?

Sorry for my ignorant questions, but maybe some of you can offer some hints for optimization. Currently the way I store things seems highly inefficient to me. I am happy to provide further examples if needed.

Best wishes,
Florian

_ROOT Version: 6.26/06
_Platform: manjaro
_Compiler: gcc

pcanal · February 7, 2023, 3:39pm

If the vector is empty, it should just store the zero and a some control bits and they should compress well. Can you quantify what you are seeing?

fjonasALICE · February 7, 2023, 4:20pm

Hi!

interesting, this is what i would have expected. Yes I can quantify. For example, for a tree with 161926 entries, i get for a flat float that is stored once per entry:

*Br 0 :Event_Rho : Event_Rho/F *
*Entries : 161926 : Total Size= 650347 bytes File Size = 56899 *
*Baskets : 20 : Basket Size= 32000 bytes Compression= 11.25 *
…

it is quite compressed because also this quantity can be sometimes 0. For reference, here is another quantity stored in double that is 0 less often:

*Br 4 :Event_ZVertex : Event_ZVertex/D *
*Entries : 161926 : Total Size= 1300383 bytes File Size = 496869 *
*Baskets : 40 : Basket Size= 32000 bytes Compression= 2.58 *

Taking for example for reference now the ZVertex with 496869 filesize. This is the size of the branch that contains the std::vector:

*Br 7 :Cluster_E : vector *
*Entries : 161926 : Total Size= 2277992 bytes File Size = 252636 *
*Baskets : 91 : Basket Size= 32000 bytes Compression= 8.97 *

If I look at this branch in TBrowser, it has only about 13 entries! So in total only 13 particles are found in all events. However, even though most of the vectors are of size 0, it still gives 252636 as the filesize, which is about half of the amount stored for the Double_t that has 161k entries. I would have expected a much smaller filesize overall. But you are correct, some compression seems to be happening (according to the compression column)

Best regards,
Florian

pcanal · February 7, 2023, 5:24pm

This is roughly the number I expect and it is indeed no the most effecient (however not the size takes 4 bytes, the same as a float (so the bare minimum would be uncompressed 650347 plus the data for the non empty vector).

To reduce the size further you can either trying switching to the new RNTuple or use C-array instead. I.e.

// This removes some redundant information from the onfile representation of the TTree and TBranch.
ROOT::TIOFeatures features;
features.Set(ROOT::Experimental::EIOFeatures::kGenerateOffsetMap);
fAnalysisTree->SetIOFeatures(features);
 
Int_t Cluster_N = 0;
Float_t *Cluster_E = new Float_t[max_number_of_element]; // avoid reallocating this during processing, if you do you need to inform the branch.
fAnalysisTree->Branch("Cluster_N", &Cluster_N,"Cluster_N/I");
fAnalysisTree->Branch("Cluster_E", &Cluster_E,"Cluster_E[Cluster_N]/F");

This is in particular helpful if you have several collection of the same size (Cluster_N) where this technique allow to avoid the size to be replicated inside the branch of each vector.

fjonasALICE · February 7, 2023, 8:42pm

Thanks for the prompt reply! I will use your suggestion in the future, its indeed quite straight forward. For me the question is solved. Just a few small question for my own understanding: I do understand the savings using your approach, where the size of the collection only needs to be stored once in one branch of the tree and not duplicated. For the cases where Cluster_N is zero, will this also lead to additional savings because the tree does not need to “worry” about the metadata and will just put an empty array in all the following branches for which it already knows the size is 0? In particular, will the empty array take up less space than the empty vector?

Second question: Why are the experimental IO features needed?

Thanks so much for your help
Florian

pcanal · February 7, 2023, 9:04pm

For historical reason, without it the information (size of array) is actually duplicated in the basket’s meta-data.

will this also lead to additional savings because the tree does not need to “worry” about the metadata and will just put an empty array in all the following branches for which it already knows the size is 0

Actually both savings are true whether the array is empty or not (avoiding duplication of size of array between basket and data (experimental mode) and avoiding additional meta data weaved into the data (in essence the vector is being treated as an object rather than raw number)

system · February 21, 2023, 9:05pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.