How to achieve best TTree loop performance vs file size

tadej · August 12, 2019, 4:08pm

I haven’t found consistent answer or me so I will ask the question again. For our analysis we make intermediate ntuples (flat TTree). Adding systematics can give ~2000 branches in a tree. We moved to this model from having one tree per systematic as this reduced the storage needed significantly.

But now I observe quite slow looping over the trees. Note that we merge ntuples to relatively large files (up to 90 GB) so that the amount of events per file is more consistent.

The questions are:

Does the number of events in a tree matter? We can keep the individual file size smaller.
How does the number of branches affect the performance? We do not load all branches at once when processing the file (only ~30 branches).
Should we use non-default settings with handling such big amount of branches when creating the tree?

Thanks in advance for all answers (and sorry if some have already been answered before).

ROOT Version: 6.18.00 (LCG96)
Platform: Centos7
Compiler: gcc8

StephanH · August 13, 2019, 7:47am

Hi @tadej,

I invite @pcanal or @Axel to help here, but maybe give us a bit more info first:

How exactly are you reading the branches?
– e.g. SetBranchAddress, something else?
What kind of loop are you using?

I’m asking because there are extremely slow ways to loop, e.g. for event in tree from python.

tadej · August 13, 2019, 8:00am

I use SetBranchAddress and then I call tree()->GetEntry(i); in a standard C++ loop.

eguiraud · August 13, 2019, 9:12am

Hi,
I’m not the expert but maybe I can provide some insight while we wait for the authoritative replies.

If you use TTree directly, you might want to SetBranchStatus("*", 0) and then SetBranchStatus("...", 1) for only the branches you need to read.

The number of events in a TTree only matters insofar as the time to process a TTree increases roughly linearly with the number of events, as one would expect.

The number of branches actually read affects performance (TTree might read branches that you don’t need if their BranchStatus is 1), the number of total branches in a TTree should not matter much afaik.

Cheers,
Enrico

tadej · August 13, 2019, 9:26am

Yes, I do call SetBranchStatus("*", 0) first.

What I have tested now is to make files smaller and I think they do run faster. So I guess having a large file and then seeking to some event N takes longer if the file is larger.

eguiraud · August 13, 2019, 9:31am

seeking to some event N takes longer if the file is larger.

Yes, that’s the case: TTrees are not optimized for random access, but rather for sequential reading.

EDIT: in other words, if you go from event N to event N+1, that should be fast, no matter how large N is. if you skip a large amount of events, or perform random access, that will typically be slower.

amadio · August 13, 2019, 9:31am

One suggestion is to make friend trees for each of your systematics. Then you can reuse the main file, and not have duplication.

pcanal · August 13, 2019, 8:00pm

If you use SetBranchStatus and TTree::GetEntry, the GetEntry function has to loop over all the branches to vet the one that are enabled or not. With larger number of branches, you are better off using LoadTree and TBranch::GetEntry. [Note that if you are indeed accessing the entry in random order, the effect I describe will be minor compare to the code of reading and decompressing some/most of the data multiple time].

Cheers,
Philippe.

tadej · August 13, 2019, 8:26pm

Thanks Philippe! I can try that.

One question: Do I need to wrap the tree in TChain and call LoadTree if I only have one?

pcanal · August 13, 2019, 8:43pm

No, if you only have one TTree you can skip the call to LoadTree. (On the other hand, adding it (even if you call it on a TTree) will make your code ‘TChain’ ready … albeit reading out-of-order in a TChain would be even worse )

tadej · August 13, 2019, 8:56pm

I actually do read in order. It’s just often I split jobs and I guess it needs a while to “seek” to the start of the loop which might be entry #10M.

pcanal · August 13, 2019, 9:14pm

Starting a entry 0 or entry 10M should take exactly the same amount of time …

tadej · August 15, 2019, 5:47pm

Thank you all for your suggestions. I could improve the performance a lot.

But now the main problem is that I chose an unfortunate ordering of the branches
The average read transaction is 26.590588 Kbytes. Is there a way to force larger transactions in TTreeCache without remaking the tree?

pcanal · August 15, 2019, 7:36pm

That should not be related.

After the first call to GetEntry what is the value returned by:

tree->GetCacheSize()

tadej · August 15, 2019, 10:28pm

The cache size after the first GetEntry call is 6606867.

This is the final report after 100k events (with all the default settings):

Number of branches in the cache ...: 91
Cache Efficiency ..................: 0.963868
Cache Efficiency Rel...............: 1.000000
Secondary Efficiency ..............: 0.000000
Secondary Efficiency Rel ..........: 0.000000
Learn entries......................: 100
Cached Reading.....................: 96357395 bytes in 3393 transactions
Reading............................: 0 bytes in 0 uncached transactions
Readahead..........................: 256000 bytes with overhead = 472156423 bytes
Average transaction................: 28.398879 Kbytes
Number of blocks in current cache..: 3312, total size: 6577808

And this are the TFile statistics: Read 0.151589 GB in 3398 transactions.

pcanal · August 15, 2019, 10:32pm

Odd … Can you share the file and the script?

tadej · August 15, 2019, 11:11pm

Something weird happened now. I processed the same file via xrootd (from EOS) and the result looks much more reasonable:

Number of branches in the cache ...: 91
Cache Efficiency ..................: 0.924730
Cache Efficiency Rel...............: 1.000000
Secondary Efficiency ..............: 0.000000
Secondary Efficiency Rel ..........: 0.000000
Learn entries......................: 100
Cached Reading.....................: 100882823 bytes in 16 transactions
Reading............................: 0 bytes in 0 uncached transactions
Readahead..........................: 256000 bytes with overhead = 0 bytes
Average transaction................: 6305.176438 Kbytes
Number of blocks in current cache..: 3220, total size: 6424412

[00:56:55] Read 0.163272 GB in 21 transactions.

If I run it locally (ceph filesystem), I still get a lot of transactions. I think it might be a bit hard to reproduce then. But why should the underlying filesystem matter?

In any case the file is here and shared with you:
/eos/user/t/tadej/shared/MultiLeptonAnalysis/test/v1/diboson_tree.root
The branches I read are here: branches.txt (2.3 KB) (not sure why less than 91)

I will try to prepare a minimal example to reproduce tomorrow (our framework is quite big).

Outline of the code:

Get the branches and put them in a vector (tree->GetBranch(name.c_str()))
Loop over the events (100k):

    int64_t entry = tree()->LoadTree(event);

    for (TBranch *branch : _branches) {
        branch->GetEntry(entry);
    }

pcanal · August 15, 2019, 11:25pm

The local file read (i.e. posix) does not have a “vector read instruction” (i.e. one system call to read multiple non-consecutive area of the file), so the sparse read (likely the case if you read only a few branches) are issue one actually read/transaction per basket. There is an effort made to coalesce the read that are ‘close by’ and is control by the setting called in the prinout ’Readahead and indeed it activated quite a bit in your case.

So it should be ‘just’ fine as is.

system · August 29, 2019, 11:25pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.