File caching and reading large TTrees

eoltman · September 27, 2007, 3:14pm

I am using root 5.06/00 on windows XP. One type of TTree I use often has many entries (>10**8) and are made up of < 20 branches, each with a primitive data type (UShort_t,Float_t, Int_t, etc) Sometimes it is desirable to sample a few branches from these large TTrees - e.g. read every 200th entry for 3-4 branches and nothing else. What I’ve observed for a 1 GB file is that the first time I sample, it taks about 60 seconds, but only 13 sec of CPU time. If run the sampling program a second time, it takes about 13 seconds (down from 60), same as the cpu time. Obviously, parts of the file are getting cached during the first read and the second read benefits from that. This is fine if I plan to spin through the file multuple times (assuming the file cache is large enough) but sometimes I only want to read it once and move on to the next file.

What I found is that if I first read the root file (fread the file’s bytes with 10kB buffer and don’t do anything with them) and then run the sampling code described above, I get significant improvement: The first read (to populate the cache - I don’t look at the data) takes about 22 seconds while the second read still takes only 13 seconds for a total of 35 seconds of actual time, as compared with 60 seconds.

The file I have been testing with is 965 MB. (I have 2 copies that I switch between to avoid cache effects between tests) I doubt the entire root file fits in cache (plus I have root files several times larger than this) so there may be an opportunity to optimize, e.g. periodically freshen up the cache with buffered “pre-reads” during normal root access of the TTree.

My desktop machine is a Xeon 3GHz with 2GB RAM with an SATA hard drive. The files are local.

Comments? I know I’m a bit out of date with my root version… should I expect similar issues with later versions of root? What about 64 bit - presumably vista can have a larger file cache?

Ed Oltman

brun · September 29, 2007, 8:14am

When reading only a small percentage of the data (eg 1% and your buffer size is10K), all basket buffers have still to be read and unzipped (when using fread you do not unzip). With this use pattern you have interest to decrease the branch basket size et use a version of ROOT with the TTreeCache (5.14 or newer).

Rene

eoltman · December 6, 2007, 10:19pm

Rene,
Do you expect any performance to come from TTreeCache if reading trees directly over a LAN? We have a CIFs share running windows server 2003. Or is it necessary to to have xrootd server/platform to realize Thnaks
Ed

brun · December 7, 2007, 6:24am

TTreeCache always help, even on a LAN (when you have concurrent access from different applications to the same disk).
When combined with the most recent version of xrootd, it will even help more, because xrootd
-can read ahead the blocks communicated by the TreeCache
-can make asynchronous transfers
-can make multiple transfers in parallel (mainly useful on high latency networks)

Rene