Fastest way of reading all events of a TTree

John · April 19, 2004, 3:41pm

Rene,

You are right that there are many files I have to deal with. The way we are dealing the situation is to read one tag file at a time. Let me know if this is horribly inefficient. One major reason for doing this is that the tag files will be generated one at a time and we are planning to run a reader that is capably of keeping up with the generation of the tag files. My understanding has been that this does not cause major efficiency concerns.

Regarding the number of tag files, here is brief note. A tag file is not a “main” data file in this experiment, it is a file containing high-level “summary”/“physics” attributes about the collision events. There are basically one tag file per “main” data file. There are on the order of 100,000 tag files. The experiment probably produces no more than 100,000 tag files per year.

Given that we can deal with tag files one at a time, the simplest thing to do is to write a program to read one tag file at a time. A large experiment is producing these tag files, my only option is to find the best way to read the files. There is no way for me to tell them to write the files in a particular way or restrict the content in any way.

To recap, a large experiment is going to generate a large number of tag files (each of which is relatively small), I plan to read the content of the files and write it out to another serious of files. The output files contains one variable in a file. The tag files may have different number of leaves(variables), and the number of events in each tag file is not fixed.

Hopefully, my description is clear enough for you to formulate your own solution to the problem. If you thought of something different from what I have done in the test code, please let me know. I would be very happy to brag that I have got the best option from the best source.

John · April 20, 2004, 4:48pm

Rene,

Since ROOT files are actually “column-oriented”, is there any chance that you will provide a “column-oriented” access functions, for example, a function to read all values of one leave ? With this function, it would make my job of reading the files a lot easier. Won’t you agree ?

John

[quote=“Rene Brun”]John,

ROOT Tree files are “column-oriented”, not “row-oriented” !
At the Tree creation time, you can optimize the Tree/Branch storage
in view of future queries

by allocating large buffer sizes for the branch(es) the most
used in queries
by storing these branches directly to separate files
(see TBranch::SetFile)

It is totally unrealistic to assume that you can fit one branch in memory.
The TAG files might be small, but you will have zillions of these files.
You need an automatic disk overflow mechanism in writing and reading.

Rene[/quote]

brun · April 20, 2004, 5:17pm

Simply call TBranch::GetEntry instead of TTree::GetEntry

Rene

John · April 20, 2004, 6:59pm

Rene, Thanks for the infomration.

TBranch::GetEntry retrieve all leaves of one branch. I wanted something at a higher level. Let me describe my use of term of horizontal and vertical partition first. It would make things a little clearer, I hope.

Instead of taking the object oriented view for the data, let us go back to the relational table view. This view is ok for tags most of the time if the tags of the same design are place in one table. In this case, a row of the table corresponds to one event, a column of the table corresponds to one atomic value (only exception is null-terminated string) in a row. Using this terminology, a leave with 15 integer values would be translated as 15 columns. When I say the table is partitioned veritically, I mean all rows of one column is organized together in a file. In my program, I am need to read the content of a tag file and place all data in vertical partitions so that they can be easily exported to a system that works on vertical partitions. Based on my understanding of the ROOT refernece guide, the branches pretty much can be accessed indenpendently. Therefore, it is efficient to read one branch at a time. Within one branch, my only option appears to read one entry at a time with GetEntry and GetValuePointer. The column oriented reading function I was asking about basically should performs the following loop and return the values read, (it should be to ask the user to provide a pointer to a piece of memory for the output, this will automatically let the user know NOT to do this on large files with millions of events.)

for (i = 0; i < GetEntries(); ++ i) {
TBranch->GetEntry(i);
//get all values after GetValuePointer
}

In the sample code that I have uploaded, there are two read functions, readEvent and readBranch. The first one attemtps to read one event at a time. It builds a set of arrays to store one event after it reads the properties of the leaves. In this case, set branch address is done only once for each file. As far as I can tell, the code generated with MakeClass does the same thing. After it has read one event, it copies the content of one event to the vertical partitioned structure for future output. The second read function, readBranch, reads one branch at a time. For each branch (actually a leaf), it reads the values using a combination of GetEntry and GetValuePointer. This function never sets branch addresses. Given that I want the code to handle different types of tag files produced by the experiment, and the fact that I can not expect to have all tag files ready on disk before running my program, the two options that I have presented are the best options that I could come up with. Since they appear to perform the same after increasing the memory buffer size, I have decide to use something close to readBranch. If you have any other options, please let me know.