Multiple TChain using the same list of files

Adrian · April 13, 2018, 11:36am

Hi,

I have a list of root files, each of them contains 3 TTrees. I’m currently using 3 TChain objects to chain the TTrees. When I do so, the files are read 3 times in a row. Is there a better way to extract the TTrees, reading the files a single time?

TChain *ch1 = new TChain("mytree1");
ch1->Add("./*.root");
TChain *ch2 = new TChain("mytree2");
ch2->Add("./*.root");
TChain *ch3 = new TChain("mytree3");
ch3->Add("./*.root");

eguiraud · April 13, 2018, 2:10pm

Hi,
each TChain reads its list of files, so it’s expected you read each file 3 times with 3 TChains.

You can specify the name of the tree to TChain::Add (see the doc) like this:

ch->Add("file.root/treename");

Can you try

TChain *ch = new TChain();
ch->Add("./*.root/mytree1");
ch->Add("./*.root/mytree2");
ch->Add("./*.root/mytree3");

and let us know if that works?

Cheers,
Enrico

Adrian · April 16, 2018, 11:32am

Hi,

Yes it works but I’m afraid it is not what I need. mytree1, mytree2 and mytree3 do not have the same nature and should not be chained together. My question was about having 3 distinct TChain objects associated to a single list of files. From your answer, I understand it is not possible.

Do you know any other way to optimize this kind of data access?

Thank you.

eguiraud · April 16, 2018, 11:41am

Hi,
consider that TTrees are not read from disk in bulk. Each time you read an entry, that entry and only that entry (and only the branches you need, ideally) are read from disk. (I’m simplifying, ROOT does some smart things like pre-fetching and reading clusters of entries at a time, but these things do not concern your case).

So in your situation you have to loop over your three files three times.
What is the bottleneck exactly, and how did you measure it?
Note that your operating system should put the files in your system cache the first time you read them. On linux the free command gives you information about system cache usage, on my workstation it’s 13GB: if your files are less than that, they will be hot in cache the second time you read them.

EDIT: the real question is: are you sure there is duplicated work in your workflow, and what exactly is this duplicated work? Opening the same file in three TFiles is not the same as reading the whole file three times!

Cheers,
Enrico

Adrian · April 16, 2018, 11:51am

Hi,

I have to work with a big data set. I’m currently studying the best way to optimize the data access. My question is only based on a “feeling” that defining 3 times the same list of files is probably not optimal.

My root files are time sorted. When reading the data set, they are loaded sequentially. Do you think that the way I access the data in the 3 TTree (see my first post) is optimal in that case?

Thank you

eguiraud · April 16, 2018, 12:05pm

Hi,
defining the same list of files three times, per se, is not a problem.
I/O performance is a function of the actual disk access pattern, i.e. of how you use those lists.

The best possible pattern (making the simplifying assumption that you eventually read all bytes of all files) is starting from the beginning of the first file and read bytes in order, without ever reading the same byte twice.

The second best possible pattern takes advantage of your file system cache:
you read no more than your cache size, then re-read it as much as you need while it’s hot in cache, then you switch to reading and re-reading the next chunk, etc.

ROOT (pre-fetching, clustered reading and unzipping…) helps you by not making too many system calls, but as far as I know that works per tree.

Also, make sure you actually have a bottleneck and the bottleneck is where you think it is before optimizing too much

Hope this helps,
Enrico

sbinet · April 16, 2018, 12:22pm

note that, IIRC, TChain::Add(fname) will actually call, at some point, TFile::SysOpen.

depending on the number of files and the type of the underlying filesystem, this could result in a stat(3)-storm “attack” on the fs.

system · April 30, 2018, 12:22pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.