Data Analysis Strategy using multiple TTrees/Multiple files

blk · August 19, 2006, 4:44pm

Hello Rooters,

I have a problem that I’m working on where I need to analyse a large amount of data. The analysis will require that the data are organized by time and by geographic location (among other factors). I am considering two strategies and was hoping that I might get some guidance from the ROOT community in terms of I might best proceed. The first option I am considering is breaking the dataset into multiple TFiles/TTree where each file contains data for a particular interval of time and a particular location. This will likely result in many small files. The second option I’m considering is to put all the data into a single file/tree. The question I have is what are the trade-offs between having the data pre-filtered and organized into chunks (multiple files) vs. using TSelector and/or cuts in the tree searching mechansims within a single file. I expect there will be a lot of performance penalty with accessing multiple files and so forth. On the other hand, I assume there is a considerable cost in filtering values from a large tree.

Thank you for any advice.

ardashev · August 21, 2006, 2:41pm

If number of small files is manageable - it is preferred in terms of speed.

You can always chain small files together with TChain if need be.

Merging all data into one big file is certainly simpler to manage, but then you got to have a database which woud allow you to jump to correct data piece inside that file based on event number, OR have one branch “time” and another “location”, which would be the only ones you pre-read with selectors.
Once selector returnes true you read the whole event.

Remember, though that if your small files intermingle inside of big one then ROOT, OS, and possibly hardware optimizations such as read-ahead, buffer-read etc. will make your program waste time on reading unnecessary events.