Plotting from several TTrees of size ~GB

mbanderson · May 25, 2010, 9:24am

I’ve noticed that opening (via TFile()) a handful of large (~3 GB) TTrees at the same time greatly slows down my scripts - even before using them. So I’m trying to think of a faster way to write my code.

These trees contain events that have different cross sections, and also I’d like to apply different cuts to each.

I’m guessing that because I need to apply different cuts to each ttree means I can’t use a TChain?

So then, is this the fastest strategy: open one file, make histograms from it, close file, open next file, make histograms … then in the end scale & add the histograms?

Will that make for the fastest code?

wlav · May 25, 2010, 6:33pm

Hi,

can’t answer the ROOT/TTree specific bit for you (and another sub-forum may be better for that), but as concerns python, the only thing I can think of, is don’t do “from ROOT import *”.

Cheers,
Wim

P.S. I’ll be offline until June 3rd due to family matters.

pcanal · May 25, 2010, 6:42pm

[quote]I’ve noticed that opening (via TFile()) a handful of large (~3 GB) TTrees at the same time greatly slows down my scripts - even before using them.[/quote]This is unusual. There may be a problem in the way you open those file and load the TTree. One other possible problem is that you TTree where ‘memory resident’ TTree before being copied to the file and hence whenever you load the TTree you would end up loading ALL the data at once in memory (However this would be true only if the file was written with older version of ROOT).

[quote]So then, is this the fastest strategy: open one file, make histograms from it, close file, open next file, make histograms … then in the end scale & add the histograms?[/quote]Anyway, this is indeed the most efficient way to go unless you tree to directly compare/use data from 2 of the TTrees.

Cheers,
Philippe.

mbanderson · May 25, 2010, 8:59pm

Sorry, I should have been more exact.

The slow-down occurred after grabbing a couple GB-sized TTrees from different files this way:

file = TFile("file.root","read") ttree = TTree() file.GetObject("treeName", ttree)
Does that try to load the whole tree in to memory or do something that would slow the code down?

pcanal · May 26, 2010, 12:03pm

Hi,

This is the correct way of retrieving the TTree. Can you send me the ROOT file(s) that cause you troubles?

Philippe.

brun · May 26, 2010, 12:18pm

or post the result of
tree.Print()
file.ls()

Rene

mbanderson · May 27, 2010, 4:01pm

Sorry, I didn’t mean to draw so much attention. I was mostly trying to understand at what point a TTree is loaded from a hard drive into RAM.

After further investigation on trying to improve the speed of a script which was trying to Draw from several large TTrees into scaled & added histograms, I made an over-simplified example for myself and confirmed what Philippe was saying.

That is, code like this is slower:

for i in range(5): for tree in list_of_trees: tree.Draw( "et[0] >> temp", all_cuts, "goff")
and code like this is faster:

for tree in list_of_trees: for i in range(5): tree.Draw( "et[0] >> temp", all_cuts, "goff")
So, one should Draw all the plots they need from one TTree at a time, and avoid switching between TTrees, especially if they are GB in size. Does that sound right? And given enough RAM, both ways should be equally fast?

pcanal · May 27, 2010, 4:38pm

Hi,

There should be no difference whatsoever in the ram used directly by ROOT in both case. If there is then you TTree is unusual and having access to file or to t->Print() would help us understand the situation.

However there is strong run-time difference between the 2 which is due to the disk caching effect. When you do:for tree in list_of_trees: for i in range(5): tree.Draw( "et[0] >> temp", all_cuts, "goff")then the operating system can load in its own disk cache each file exactly once. While when you dofor i in range(5): for tree in list_of_trees: tree.Draw( "et[0] >> temp", all_cuts, "goff")you force the Operating system to read each files 5 times from the disk. (well unless the sum of the file size is much small than the size of ram).

[quote]So, one should Draw all the plots they need from one TTree at a time, and avoid switching between TTrees, especially if they are GB in size. Does that sound right? And given enough RAM, both ways should be equally fast?[/quote]So yes, due to disk caching effect. Note also that upgrading to v5.26 and enabling the TTreeCache will reduce the time it takes to load each file from disk.

Cheers,
Philippe.

mbanderson · June 3, 2010, 8:48pm

Attached here is the Print() of one of the TTrees I’m using to make plots from. What factors should one look for to see if a TTree is unusual?

I haven’t heard of the TTreeCache feature. I’ll have to look that up, thanks.
mpa_tree.txt (70.2 KB)

pcanal · June 3, 2010, 9:47pm

Hi,

Your file seems usual but since it is 5Gb, you are certainly reaching the limit of the disk cache and thus you will gain by doing all actions on a single files at once then move on to the next file.

Cheers,
Philippe.