TTree basket size inflating and exhausting memory

LucaC · April 22, 2021, 11:04pm

Dear experts,

I am running a code that loops over a TTree, selects entries and builds a second TTree with new branches.
I noticed a behaviour where, after some ~10 minutes of running, the memory consumption abruptly increases and makes the execution fail on batch systems where memory resources are limited.
You can see attached the memory usage profiled, and how this behaviour goes away if I comment the tree->Fill() instruction in the code.

It seems to me that this is due to an abrupt increase in the basket size of each of the ~200 branches of the output TTree, as I checked with an example printout using tree->GetBranch("Run")->GetBasketSize(), the result is pasted below [*].
So somehow the basket size changes from 32000 to 7185920 (x200!) filling up the available memory.

I can try to work out a minimal example to try to reproduce the effect, but I wanted to ask you if you could point me to possible causes that can trigger such an effect - having an idea of what causes the basket resize, and making sure that this impacts the memory usage, would greatly help debugging this issue.

Thanks for your help!
Cheers,
Luca

[*]

... processing event 3380000
... tree basket size (branch Run) : 32000
... processing event 3390000
... tree basket size (branch Run) : 32000
... processing event 3400000
... tree basket size (branch Run) : 32000
... processing event 3410000
... tree basket size (branch Run) : 7185920
... processing event 3420000
... tree basket size (branch Run) : 7185920
... processing event 3430000
... tree basket size (branch Run) : 7185920

mem_usage_withTreeFill.pdf (14.4 KB)
mem_usage_noTreeFill.pdf (15.7 KB)

Please read tips for efficient and successful posting and posting code

ROOT Version: 6.12/07
Platform: Scientific Linux 7.9 (Nitrogen)
Compiler: gcc 7.3.1

pcanal · April 22, 2021, 11:39pm

The increase is expected. The scale is not (2GB !?)

After the TTree has filled a good amount of data (32 Mb of compressed data), it resize the basket to try to fit the data needed to fill 32 MB of compressed data into one basket per TBranch. This essentially result in memory allocation of 32 MB times the compression ratio.

If the number quote above are correct, this would indicate a compression factor of nearly 43 (i.e. the data is almost all repeats).

You can easily verify this by running you example locally and after 3420000 entries/event and doing a TTree::Print()

The size at which the basket size optimization happens can be controlled via a call to TTree::SetAutoFlush (negative value to express the limit in compressed MB and positive value to express the limit in number of entries).

Cheers,
Philippe.

LucaC · April 23, 2021, 9:06am

Hi Philippe,

thanks a lot for the explanation. Indeed, individual branch compression factors values vary a lot, from 1.2 (for float branches that take a different value per event) up to 145 (for branches that are not filled for this case and take always the same default value).
Overall the Tree compression factor that is reported by TTree::Print() is 16.77.

I missed whether this is just a consequence of the large (but is it so large?) number of branches, and I have to control the memory used by setting something below 32 MB in TTree::SetAutoFlush().
What would be the effect in terms of performance in this case?

Or could it be a side effect of some property of the Tree (for example the high compression factor of some branches that all take the same value)?

Cheers,
Luca

eguiraud · April 23, 2021, 9:53am

(Do you have to read the branches with compression factor 145? If you don’t read them, you won’t have high memory usage due to exceedingly well-compressed data being uncompressed into memory buffers)

(Also I realize that would complicate the code with some branching, but a possibly cleaner design would be to not have those branches at all if they are not filled with anything meaningful)

LucaC · April 23, 2021, 10:15am

Hi, indeed I guess that the best solution is just to skip the creation of those branches for the cases where they are not going to be filled. I was not suspecting that this could cause high memory usage, thanks a lot for the feedback!
Cheers,
Luca

eguiraud · April 23, 2021, 11:00am

As a quicker workaround you could also just not read them back. As a less quick workaround you could write them out with a very small basket size (but it’s useless work as you are writing meaningless data anyway).

LucaC · April 23, 2021, 2:06pm

Would you have an example of how to avoid to read them back in this case? I am not reading the TTree in this code, just writing it to disk (of course I imagine that the same issue would show up when I read the same tree in the donwstream code).

eguiraud · April 23, 2021, 2:09pm

Ah my mistake, if the problem is when writing data, of course “don’t read it back” is unrelated – when you do read the TTree back, you can e.g. deactivate the reading of all branches with tree.SetBranchStatus("*", 0) and then only reactivate those that you need (high-level interfaces like RDataFrame do that for you automatically).

system · May 7, 2021, 2:10pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.