Reduce TTree key size

ROOT Version: 6.30.08
Platform: linuxx8664gcc
Compiler: g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

Hi,

I am trying to optimize the compressed size of a .root file and I’m confused about how TTree compression and key sizes interact.
Sometimes I see a large “key” for the same kind of data, but it varies across files. Below is the macro I use to list file keys and report the compressed tree size:

int size(const char* filename = "out.root") {
  const double MB = 1.0e6;

  TFile f(filename, "READ");
  auto* t = (TTree*)f.Get("cbmsim");

  std::cout << std::fixed << std::setprecision(3);

  // List file keys and sum their sizes
  double sum = 0.;
  if (auto* keys = f.GetListOfKeys()) {
    TIter it(keys);
    while (auto* obj = it.Next()) {
      auto* k = (TKey*)obj;
      const double mb = k->GetNbytes() / MB;
      std::cout << k->GetName() << ' '  k->GetClassName() << ' ' << mb << " MB\n";
      sum += k->GetNbytes();
    }
  }
  std::cout << " - - - - - - - - \n";

  std::cout << "Sum of key sizes: " << (sum / MB) << " MB\n";
  std::cout << "Tree size (compressed): " << static_cast<double>(t->GetZipBytes()) / MB << " MB\n";
  std::cout << "File size       : " << (static_cast<double>(f.GetSize()) / MB) << " MB\n";

  return 0;
}

A typical output is this one

cbmout                    TFolder               0.000 MB
BranchList                TListc                0.001 MB
TimeBasedBranchList       TList                 0.000 MB
FileHeader                FairFileHeader        0.000 MB
cbmsim                    TTree                 5.522 MB
cbmsim                    TTree                 0.033 MB
- - - - - - - - - -
Sum of key sizes:          5.557 MB
Tree size (compressed):    61.341 MB
File size :                66.916 MB

Here, one cbmsim key is ~5.5 MB. (In other files of the same type it can be < 1 MB or > 10MB; I don’t see a clear pattern.)

I tried to get rid of the auto save

tree->SetAutoSave(0);

This removes the second (small) cbmsim key, as expected, but the large one remains.

Then I tried to set only a single cluster

tree->SetAutoFlush(0);

output:

cbmout                    TFolder               0.000 MB
BranchList                TList                 0.001 MB
TimeBasedBranchList       TList                 0.000 MB
FileHeader                FairFileHeader        0.000 MB
cbmsim                    TTree                 0.768 MB
 - - - - - - - - 
Sum of key sizes:       0.769 MB
Tree size (compressed): 66.807 MB
File size :             67.585 MB

Now the key is small (good), but the compressed tree size increased from ~61 MB to ~67 MB. I expected that with no auto flush (baskets written only when full) compression would stay the same or slightly improve, not get worse.

I am wrong somewhere but can’t figure out. To resume my problem I would ask:

  • Why does the TTree key size vary so much between files of the same structure?
  • Why does SetAutoFlush(0) can increase the compressed size of the tree?
  • Is there a recommended way/tooling to reduce key size without hurting the compressed tree?

Thanks,
Clement

Hi @Clement_Devanne,

I believe @pcanal can help you here.

Cheers,

Marta

Use

std::cout << k->GetName() << ' (' << k->GetCycle() << ') ' <<  k->GetClassName() << ' ' << mb << " MB\n";

to distinguish the current/last key from a backup. For example in:

cbmsim                    TTree                 5.522 MB
cbmsim                    TTree                 0.033 MB

The first line is the current key while the second one is the backup.

Here, one cbmsim key is ~5.5 MB. (In other files of the same type it can be < 1 MB or > 10MB; I don’t see a clear pattern.)

You must contrast with the number of entries in the TTree, this will (of course) greatly affect the size.
So also use:

std::cout << "Number of TTree entries: " << t->GetEntries() << " MB\n";

I also recommend you look at the uncompressed size of the key itself:

      auto* k = (TKey*)obj;
      const double mb = k->GetNbytes() / MB;
      const double uncompressed_size = k->GetObjlen() / MB;
      std::cout << k->GetName() << ' (' << k->GetCycle() << ') ' <<  k->GetClassName() << ' compressed size:' << mb << " MB full size " << uncompressed_size << " MB\n";
      sum += k->GetNbytes();

Why does the TTree key size vary so much between files of the same structure?

The (uncompressed) size of the TTree is strictly linearly correlated to the number of baskets used to store the data.

Then I tried to set only a single cluster

This is actually likely to increase the number of baskets and thus the size of the TTree key as this prevents the mechanism that resize the basket size to aim towards using one basket per cluster and usually (unless you have a lot of branches) results in larger baskets (the default is 32KB per baskets).

Is there a recommended way/tooling to reduce key size without hurting the compressed tree?

The recommendation is to set the AutoFlush as high as your available memory allows. You can either set it to a number of entries (positive number) or a target compressed size of the whole cluster (via a negative number) (in this case the memory required will be that number plus that same number multiplied by the average compression ratio of the data itself)