Estimating the size of THnSparse

pkonopka · August 25, 2020, 2:44pm

Hello!
In order to have some estimates, I am trying to compute the memory size of different THnSparse histograms, depending on the number of dimensions, bins, entries. I read in the documentation that:

Bin data (content and coordinates) are allocated in chunks of size fChunkSize;

So I try doing the following:

sparse->GetNChunks() * sparse->GetChunkSize()

Which, however, does not provide plausible results. For example, I see 2 chunks of size 16k for a sparse integer histogram with 8 dimensions, 32k bins each, and 32k randomly distributed entries (they are random, I’ve checked). On the other hand, I would expect it to use at least 32k (entries) * (4 B (integer)+ 16 * 8 (space for coordinates)).

Would you have an idea what I am wrongly assuming? Is there the correct way to compute the absolute memory size of a sparse histogram?

This is how I create the histogram, maybe I do that incorrectly:

  const size_t bins =32768;
  const size_t dim = 8;
  const size_t entries = 32768;

  const Double_t min = 0.0;
  const Double_t max = 1000000.0;
  const std::vector<Int_t> binsDims(dim, bins);
  const std::vector<Double_t> mins(dim, min);
  const std::vector<Double_t> maxs(dim, max);
  auto* h = new THnSparseI("test", "test", dim, binsDims.data(), mins.data(), maxs.data());

I will be grateful if you could give me some hints!

ROOT Version: v6.20.02
Platform: CC7
Compiler: GCC v7.3.0

StephanH · August 26, 2020, 7:24am

I cannot follow your numbers, but note that the sparse histograms do not allocate space for each bin. For each event that ends up in a unique bin, they have to store 9 ints (or 10 if you use weighting), the bin count, the coordinates and possibly the sum of weights.
So I estimate the maximal size of your case to 10(int)*4b = 40b.
Now, many events will fall into the same bin. This increments the count, but doesn’t need any additional memory. You could compute the average bin volume, and then you compute:
40b * 32k Evts / averageBinOccupancy.

pkonopka · August 26, 2020, 7:56am

Sorry, I can’t follow your numbers neither But if I take it as true, then:
In my case, I am pretty sure the entries (events) don’t fall into the same bins twice - entries are generated with the MixMax generator and there are 32k bins on each of the 8 dimensions. This is a pretty small chance to see the same bin twice.

So i assume averageBinOccupancy == 1
Then it is 40b * 32k = 1280kb = 160 kB (b was a bit, right?)
This still doesn’t match the size estimated from sparse->GetNChunks() * sparse->GetChunkSize(), nor my understanding.

Let me elaborate on my numbers:
I use THnSparseI, each of 8 dimensions has 32k bins. I fill the histogram with 32k entries, which should not fall into the same bins twice, with some marginal exceptions.

Then, for each bin we need:
bin count: 4B, because I use THnSparseI
coordinates: 8 * 16bits = 8 * 2B = 16B, because there are 8 dimensions, each axis has 32k bins, so the minimal bit size for each bin address is 15 or 16bits (2^16 = 64k). See in the docu: “The coordinates are compacted to use as few bits as possible”
So with 32k entries, it becomes 32k * (4B + 16B) = 640kB

Now, this is quite plausible number to me, because after I serialize such histogram, the buffer has 1,056,832B.
My problem is that I would like to get the number directly from ROOT, so I can trust it is more precise than my estimations. However, the method with using chunks count and size does not seem to provide reliable results.

StephanH · August 26, 2020, 9:42am

You didn’t post a runnable example, so we cannot know. You would have to measure yourself by e.g. asking the histogram for its maximum or using

The number of filled bins is returned by THnSparse::GetNbins();
from the documentation.

No, an integer is 4 byte, so I meant byte, B.

I didn’t know they compact the coordinates, nice. So yes, you kind of did the calculation that I proposed, only the coordinates can be much smaller than I thought. 640kB looks reasonable for 32k entries.

Note that the chunk size is not in bytes. The size is roughly

GetNChunks * ChunkSize * EntrySize = 2 * 16k * (4B + 16B) = 625 kB

Perfect fit I would say.

pkonopka · August 26, 2020, 9:59am

OK, so my fallacy was to assume that GetChunkSize() returns the size in bytes. As far as I can see, there is no method to get the entry (bin) size, so I guess I won’t get any closer than estimating it.

Thanks for the help!

system · September 9, 2020, 9:59am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.