Faster version of GetEntry()?

aregjan · January 4, 2011, 10:28pm

Hi,

I have a TTree object with 15mln+ entries. I have a script which loops through consecutive entries in the tree, and does stuff. To do this, I do something like this:

void script(TTree *tree){
 for(int i=0;i<tree->GetEntries();i++){
 tree->GetEntry(i);
 DoStuff();
 }
}

Trouble is, after about 1mln-th entry things REALLY slow down. I am guessing that during every call GetEntry starts at position zero, searches to position i in the file…which sure enough becomes slower with increasing i.

So my question is – is there a faster equivalent to this? I would like something along these lines (in pseudocode):

int i=0;
tree->GetEntry(0);
while(i<tree->GetEntries()){
 tree->MoveToNextEntryFAST(); //supposedly a very quick operation
 DoStuff();
 ++i;
}

brun · January 5, 2011, 7:35am

The time to access to a Tree entry is independent of the entry number.
You are probably hitting a memory leak problem. We need more info from your side.

Rene

aregjan · January 5, 2011, 1:10pm

Rene, here are the two test cases involving TTree::GetEntry() that I run directly from ROOT’s command line:

for(int i=0;i<1e+6;i++) { tree->GetEntry(i); if(i%100==0) cout<<i<<"\r";}

and

int events=tree->GetEntries();     
for(int i=(events-1e+6);i<events;i++) { tree->GetEntry(i); if(i%100==0) cout<<i<<"\r";}

The first one took 30 sec to complete.
The second one – 250 sec.

I don’t see where a memory leak could cause this.

p.s. events~=15e+6.

tpochep · January 5, 2011, 2:23pm

What’s inside your tree object? Can you reproduce this behavior if you have, say, tree with integers?

aregjan · January 5, 2011, 2:44pm

For the content of ptree b[/b], see below:

root [3] ptree->Print()
******************************************************************************
*Tree    :ptree     : event data                                             *
*Entries : 15414423 : Total =     32715547411 bytes  File  Size = 2249737335 *
*        :          : Tree compression factor =  14.58                       *
******************************************************************************
*Br    0 :pulse     : adcoffset[1024]/s                                      *
*Entries :15414423 : Total  Size=31664308115 bytes  File Size  = 1550563390 *
*Baskets :  1027629 : Basket Size=      32000 bytes  Compression=  20.41     *
*............................................................................*
*Br    1 :event     : samples/i:rate/f:time/l:cn/i:pattern/i:counter/i:      *
*         | min_i/i:min/i:E/F:Ec/F:Ecp/f:Ped/F:PSD/F:PSDsm/F:Chisquare/F:id/i*
*Entries : 15414423 : Total  Size= 1051238902 bytes  File Size  =  692073074 *
*Baskets :    32867 : Basket Size=      32000 bytes  Compression=   1.52     *
*............................................................................*

tpochep · January 5, 2011, 3:00pm

In my case I have a simple tree with integers (arrays in fact), file size is 1.7 GB, 15000000 entries. It’s two times faster to read the first 10^6 entries, than the last 10^6. I’m not sure, how this is implemented in TTree, probably, this is due to the large file size.

aregjan · January 5, 2011, 4:10pm

My .root file is ~2.2GB…however in my case the difference between the first and last 1mln events
is dramatic – see above.

tpochep · January 5, 2011, 5:04pm

[quote=“aregjan”]My .root file is ~2.2GB…however in my case the difference between the first and last 1mln events
is dramatic – see above.[/quote]

I’ve added more branches, and now file size is >= 3GB. It’s a bit slower now, but difference between the first 10^6 and the last even decreased - it’s ~30 % now.

Can you show the minimal code, reproducing your timings? And give your machine specs.

aregjan · January 5, 2011, 5:41pm

Ok, I rerun a smaller version of the code, the results are below:

root [0] TFile *_file0 = TFile::Open("/home/aregjan/data_vx511/root_pnpf1/run646647.root")
root [1] int events=ptree->GetEntries();  
root [2] cout<<events<<endl; 
15414423
root [3] system("date");for(int i=0;i<1e+6;i++) { ptree->GetEntry(i);}; system("date")              
Wed Jan  5 12:24:31 EST 2011
Wed Jan  5 12:25:28 EST 2011
(const int)0
root [4] system("date");for(int i=(events-1e+6);i<events;i++) { ptree->GetEntry(i);}; system("date")
Wed Jan  5 12:26:02 EST 2011
Wed Jan  5 12:36:01 EST 2011

As you see, a difference of ~10x

Machine info:
Core 2 duo 2.40GHz, Dell E4300 notebook
Bus speed: 1066Mhz
4GB of memory
Running 64bit Ubuntu 10.04

tpochep · January 5, 2011, 7:13pm

Hmmm, very nice and useless “dump”, I can produce something like this by hands:

Does it prove anything?

If your code is a top secret, try these two macros, if you still see the same difference
(fix them, if needed, I just write them here, without executing):

//tree_fill.C
void tree_fill()
{
    TFile f("tree.root", "recreate");
    TTree * t = new TTree("aaa", "aaa");
    int arr[1000] = {};
    t->Branch("arr", arr, "arr[1000]/I");
    for(int i = 0; i < 15000000; ++i)
       t->Fill();

    t->Write();
}

//tree_read.C

void tree_read()
{
    TFile f("tree.root");
    TTree * t = (TTree*)f.Get("aaa");
    if(!t)
    {
        std::cout<<"FUUUUUUUUUUUUUU!\n";
        return;
    }

    int arr[1000] = {};
    t->SetBranchAddress("arr", arr);
    TStopwatch timer;
    timer.Start();
   // for(int i = 13999999; i < 15000000; ++i)
    for(int i = 0; i < 1000000; ++i)
        t->GetEntry(i);
    timer.Stop();
    std::cout<<"Time is: "<<int(timer.RealTime())<<std::endl;
}

aregjan · January 5, 2011, 7:33pm

The output shows that the first 1mln entries were read in 57sec, and the last
1mln were read in 10min. How’s this not obvious?

tpochep · January 5, 2011, 7:36pm

[quote=“aregjan”]The output shows that the first 1mln entries were read in 57sec, and the last
1mln were read in 10min. How’s this not obvious?[/quote]

And my “output” shows, that both work 10 s, is it obvious?
Without any code reproducing the problem, it’s useless to discuss what and how you measure.
Did you try my macros?

aregjan · January 5, 2011, 7:53pm

What do you mean no code?? The root commands that I used are up there – replace my run646647.root with your lovely .root file, cut-and-paste the commands, and see what you get.

tpochep · January 5, 2011, 7:59pm

When I told you about my file sizes and time difference today, this was my lovely root file and my brilliant macros I gave you. So, the problem you are talking about, exists only with your code and not reproducible with simple macros.

aregjan · January 5, 2011, 8:09pm

There is no “my code”. There are simple two CINT commands. Which show that it takes CINT 10x longer to process last 1mln entries than the first 1mln entries.

If you want to contribute something to this – rather than get hang up in ad hominem exchanges – then I would suggest that you run those very same commands (by cutting and pasting) on your own .root file, and post the output here.

tpochep · January 5, 2011, 8:16pm

… duplicate message was removed …

tpochep · January 5, 2011, 9:06pm

[quote=“tpochep”][quote=“aregjan”]There is no “my code”. There are simple two CINT commands. Which show that it takes CINT 10x longer to process last 1mln entries than the first 1mln entries.

If you want to contribute something to this – rather than get hang up in ad hominem exchanges – then I would suggest that you run those very same commands (by cutting and pasting) on your own .root file, and post the output here.[/quote][/quote]

Listen, I do not have your file, I do not have your tree structure. And in your “code”, which I should copy paste, you even did not set branch address - are you familiar with TTree’s internals and know exactly, what happens?

That’s my output with your “two CINT commands” (I repeated them several times):

root [0] TFile f("tree.root") root [1] TTree * t = (TTree *)f.Get("a") root [2] system("date");for(int i=13999999;i<15000000;i++) { t->GetEntry(i);} system("date") Wed Jan 5 22:03:04 CET 2011 Wed Jan 5 22:03:18 CET 2011 (const int)0 root [3] system("date");for(int i=13999999;i<15000000;i++) { t->GetEntry(i);} system("date") Wed Jan 5 22:03:24 CET 2011 Wed Jan 5 22:03:39 CET 2011 (const int)0 root [4] system("date");for(int i=13999999;i<15000000;i++) { t->GetEntry(i);} system("date") Wed Jan 5 22:03:41 CET 2011 Wed Jan 5 22:03:56 CET 2011 (const int)0 root [5] system("date");for(int i=0;i<1000000;i++) { t->GetEntry(i);} system("date") Wed Jan 5 22:04:06 CET 2011 Wed Jan 5 22:04:14 CET 2011 (const int)0 root [6] system("date");for(int i=0;i<1000000;i++) { t->GetEntry(i);} system("date") Wed Jan 5 22:04:16 CET 2011 Wed Jan 5 22:04:24 CET 2011 (const int)0 root [7] system("date");for(int i=0;i<1000000;i++) { t->GetEntry(i);} system("date") Wed Jan 5 22:04:38 CET 2011 Wed Jan 5 22:04:47 CET 2011 (const int)0 root [8]

So, looks like the problem is in your tree.

aregjan · January 5, 2011, 10:41pm

Well, this is confirming my observation: that the last 1mln entries take longer to cycle through than the first 1mln. Sure, in your case the difference is 1.5x versus my 10x, but that can depend on a combination of the tree’s complexity, disk speed and particulars of TTree::GetEntry()'s implementation.

Axel · January 6, 2011, 8:32am

Hi,

can you give me access to your file, please? E.g. is it somewhere in AFS?

Cheers, Axel.

aregjan · January 6, 2011, 2:36pm

Axel, it’s a 2.2GB file, not on AFS. I am not sure how I could transfer it to you…does CERN have an ftp server where I could upload it to?