Perfomance of tree partial I/O

linev · December 3, 2003, 5:16pm

Hello

I have a question about partial I/O in trees.

If I understand correctly, idea of partial I/O is to reduce an amount of data, which is read from tree, and, as a result, gain in speed of program execution. In ROOT tutorial I found no any performance benchmarks for partial I/O (in particular, I am interesting in partial read of tree), therefore I try to create some simple script to test this feature of ROOT trees.

What is it doing? It measures time of reading tree from file, when one or several branches are activated.

Tree is generated by script. Tree has 10 branches, each brunch contain fixed size array of doubles.

To exclude file caching I write a functions PurgeMemory(), which tries to allocate all physical memory and fill it. This seems to be, forces system to release all file cache buffers.

After first tests I find out, that reading speed drastically depends from branch buffers size (so called basket size). Therefore script performs test, using different buffers size. First time it generate tree, where buffer size is equal to branch data size (not a very good idea, but it works), next time buffer size doubles, and so on up to the limit when buffers sizes 512 times bigger then data in branch.

Scripts measures real time and CPU time of reading of such trees. It writes results to another small tree. ShowTestResults() function create from this tree 2D histogram, which shows dependency of time from number of activated branches and from relative size of branch buffer (in logarithmic scale).

Script performs two tests.
First time it generating trees with 1000000 events, each has 10 branches, 10 doubles (80 bytes) in each branch. Results can be seen in Tree_small_real.gif file.
Second time it generating trees with 10000 events, each has 10 branches, 1000 doubles (8000 bytes) in each branch. Results can be seen in Tree_large_real.gif file.

I used ROOT version 3.10/01, compiled under Debian, gcc 2.95.4. Script run in compiled mode, using ACLiC. My computer is Athlon 1800+ MX, 512 MB RAM. On my computer it runs about 2 hour mostly because of big time of tree generation and time delays between each test.

Results, that I see, confusing me. When I have small branch data size (only 80 bytes), I can use buffer (basket) size, which is 100 times bigger then my data. But in this situation there is no difference, if I read only 1 branch, or if I read all 10 branches.
From other side, when branch data size fairly big (8000 bytes), I not always able to use 10 time bigger buffer (basket) size. And again in such situation I gain practically nothing reading only 1 branch or reading all 10 branches.

I also look into CPU time. It gives nice results, that in all cases you can gain factor of 10, but for me much more interesting real time, when I sitting in front of computer display and waiting, when program is finished it’s job.

Can somebody explain, that I am doing wrong or, maybe, I miss something?

P.S. If somebody wants to run this script, it should set RAMSIZE constant to correct size of computer physical memory in MB, otherwise file caching will play significant role in all tests. Computer should also have about 1 GB free disk space.

PerfomanceTest.C (6.38 KB)

brun · December 4, 2003, 8:05am

It does not make sense to create branches with buffer size less than 1000 bytes. The default buffer size 32000 bytes is in general a good starting value.

Rene

linev · December 4, 2003, 8:26am

Yes, it always good to have big buffers.
But if I have small data size in branch (fo me it is 80 bytes) and relativly big buffers (more than 10000 bytes),
I see no difference at all when I read 1 branch or when I read all 10 branches. This can be seen in Tree_small_Real.gif picture.

brun · December 5, 2003, 2:30am

This is not what I see! In your example you are still using very small
buffer sizes. Please send a small test if you want me to have a chance to
investigate

Rene

linev · December 5, 2003, 9:28am

On the graphic logarithm of the ratio between buffer size and data size is shown. Here 0 means that buffer size = data size, 9 means buffer size = 512 * data size. In this case buffer is about 40000 bytes, that is near the default 32K.

I attach script, which run only this two extreme situations - very small buffer size (basket size = data size = 80 bytes) and large buffer size (basket size = 1000 data size = 80000 bytes). You will see dependency of time execution from number of active branches in this two cases.

It takes about 10 minutes. Most time it generate tree with small basket size. I only want to stress, that RAMSIZE should be a real size of memory, otherwise file caching will play significant role. I run this tests with ACLiC.
ShortTest.C (3.75 KB)