Zipping data of more than 16MB

sponce · November 30, 2021, 8:30am

I’ve discovered recently (and the hard way) that the zipping methods provided by the ROOT framework are limited to data sizes lower than 16MB. I’m talking here of the methods R__zip and R__unzip which are found in root/RZip.h at master · root-project/root · GitHub and their practical implementations in e.g. root/ZipZSTD.cxx at master · root-project/root · GitHub or root/ZipLZMA.c at master · root-project/root · GitHub

There are mainly 2 issues here. The first one was a missing check that the maximum size is respected in the ZSTD case, and that’s handled via Lack of size validation in ZSTD compression · Issue #9334 · root-project/root · GitHub

I would like to discuss here the other issue : why such a limitation ? And can we remove it ? Indeed, the size is passed to these methods as an int, so in principle we can handle up to 2-4GB (should be unsigned to allow 4). But the limitation comes from the format of the byte stream generated and the fact that the header is 9 bytes, namely 3 magic bytes (ZS\1) and 2 sizes (original and zip) on 3 bytes each.

What about having a new magic sequence (ZS\2) and an 11 bytes headers with sizes on 4 bytes each ? Does that sound feasible ?

ferhue · November 30, 2021, 10:58am

Some potential alternative:

pcanal · November 30, 2021, 11:01am

This limitation comes from a time where 6 vs 8 bytes was important and where having more than 16MB in a buffer seemed like a very exceptional case (especially since the computers had around 64 MB of RAM at the time ).

Changing the magic sequence and the header size is an issue for 'forward compatibily` (can an older version of ROOT read a file produced with a newer version of ROOT) and probably would need to be an opt-in feature. [Somewhat related we are planning to soon extend the maximum size of buffer (the one that is compressed in 16Mb chunks) to more than 2 GB, so this improvement might be part of it].

Important question (to judge the necessity of this change), beside the bug in the ZTSD support code, have you encountered significant downside of this chunking?

Thnanks,
Philippe

sponce · November 30, 2021, 4:51pm

To be honest for the moment, we’ve only discovered the limitation and did not try to work around it. I doubt that doing the chunking ourselves would lead to really bad performance (neither size nor execution time), it’s just cumbersome and felt strange. Now clearly forward compatibility would be broken, but at that stage machines with 64MB of memory are mostly gone, no ? And it would only mean that reading back a file written by ROOT > vx needs ROOT > vx, which looks both reasonable and unavoidable at some stage.

pcanal · December 3, 2021, 6:53pm

So (beside the zstd bug) it seems that the only downside of chunking is lower compression performance and thus I don’t see a great incentive to break forward compatibility just for that. On the other hand when we introduce support for buffer larger than 2GB, this will be in itself a forward incompatible change (when used) and thus a good place to introduce the larger chunking.

clemenci · December 15, 2021, 1:37pm

Hi @pcanal, I do not think we are talking about breaking backward/forward compatibility. If the new compression algorithm uses a ZS\1 for buffers smaller than 16MB and ZS\2 (with the extended header) for larger buffers, old versions of ROOT will have trouble only with files that actually use the ZS\2 version, but ROOT files (at the moment) only use small chunks so no problem. LHCb will use the zipping/unzipping function with with larger buffers (not always) in custom files (not ROOT files) so we will end up with files that we may not be able to read with older versions of ROOT, but that’s not a problem for us.