I have a list of root files, each of them contains 3 TTrees. I’m currently using 3 TChain objects to chain the TTrees. When I do so, the files are read 3 times in a row. Is there a better way to extract the TTrees, reading the files a single time?
TChain *ch1 = new TChain("mytree1");
ch1->Add("./*.root");
TChain *ch2 = new TChain("mytree2");
ch2->Add("./*.root");
TChain *ch3 = new TChain("mytree3");
ch3->Add("./*.root");
Yes it works but I’m afraid it is not what I need. mytree1, mytree2 and mytree3 do not have the same nature and should not be chained together. My question was about having 3 distinct TChain objects associated to a single list of files. From your answer, I understand it is not possible.
Do you know any other way to optimize this kind of data access?
Hi,
consider that TTrees are not read from disk in bulk. Each time you read an entry, that entry and only that entry (and only the branches you need, ideally) are read from disk. (I’m simplifying, ROOT does some smart things like pre-fetching and reading clusters of entries at a time, but these things do not concern your case).
So in your situation you have to loop over your three files three times.
What is the bottleneck exactly, and how did you measure it?
Note that your operating system should put the files in your system cache the first time you read them. On linux the free command gives you information about system cache usage, on my workstation it’s 13GB: if your files are less than that, they will be hot in cache the second time you read them.
EDIT: the real question is: are you sure there is duplicated work in your workflow, and what exactly is this duplicated work? Opening the same file in three TFiles is not the same as reading the whole file three times!
I have to work with a big data set. I’m currently studying the best way to optimize the data access. My question is only based on a “feeling” that defining 3 times the same list of files is probably not optimal.
My root files are time sorted. When reading the data set, they are loaded sequentially. Do you think that the way I access the data in the 3 TTree (see my first post) is optimal in that case?
Hi,
defining the same list of files three times, per se, is not a problem.
I/O performance is a function of the actual disk access pattern, i.e. of how you use those lists.
The best possible pattern (making the simplifying assumption that you eventually read all bytes of all files) is starting from the beginning of the first file and read bytes in order, without ever reading the same byte twice.
The second best possible pattern takes advantage of your file system cache:
you read no more than your cache size, then re-read it as much as you need while it’s hot in cache, then you switch to reading and re-reading the next chunk, etc.
ROOT (pre-fetching, clustered reading and unzipping…) helps you by not making too many system calls, but as far as I know that works per tree.
Also, make sure you actually have a bottleneck and the bottleneck is where you think it is before optimizing too much