Faster version of GetEntry()?

Hi,

I have a TTree object with 15mln+ entries. I have a script which loops through consecutive entries in the tree, and does stuff. To do this, I do something like this:

void script(TTree *tree){
 for(int i=0;i<tree->GetEntries();i++){
 tree->GetEntry(i);
 DoStuff();
 }
}

Trouble is, after about 1mln-th entry things REALLY slow down. I am guessing that during every call GetEntry starts at position zero, searches to position i in the file…which sure enough becomes slower with increasing i.

So my question is – is there a faster equivalent to this? I would like something along these lines (in pseudocode):

int i=0;
tree->GetEntry(0);
while(i<tree->GetEntries()){
 tree->MoveToNextEntryFAST(); //supposedly a very quick operation
 DoStuff();
 ++i;
}

The time to access to a Tree entry is independent of the entry number.
You are probably hitting a memory leak problem. We need more info from your side.

Rene

Rene, here are the two test cases involving TTree::GetEntry() that I run directly from ROOT’s command line:

for(int i=0;i<1e+6;i++) { tree->GetEntry(i); if(i%100==0) cout<<i<<"\r";}  

and

int events=tree->GetEntries();     
for(int i=(events-1e+6);i<events;i++) { tree->GetEntry(i); if(i%100==0) cout<<i<<"\r";}

The first one took 30 sec to complete.
The second one – 250 sec.

I don’t see where a memory leak could cause this.

p.s. events~=15e+6.

What’s inside your tree object? Can you reproduce this behavior if you have, say, tree with integers?

For the content of ptree b[/b], see below:

root [3] ptree->Print()
******************************************************************************
*Tree    :ptree     : event data                                             *
*Entries : 15414423 : Total =     32715547411 bytes  File  Size = 2249737335 *
*        :          : Tree compression factor =  14.58                       *
******************************************************************************
*Br    0 :pulse     : adcoffset[1024]/s                                      *
*Entries :15414423 : Total  Size=31664308115 bytes  File Size  = 1550563390 *
*Baskets :  1027629 : Basket Size=      32000 bytes  Compression=  20.41     *
*............................................................................*
*Br    1 :event     : samples/i:rate/f:time/l:cn/i:pattern/i:counter/i:      *
*         | min_i/i:min/i:E/F:Ec/F:Ecp/f:Ped/F:PSD/F:PSDsm/F:Chisquare/F:id/i*
*Entries : 15414423 : Total  Size= 1051238902 bytes  File Size  =  692073074 *
*Baskets :    32867 : Basket Size=      32000 bytes  Compression=   1.52     *
*............................................................................*

In my case I have a simple tree with integers (arrays in fact), file size is 1.7 GB, 15000000 entries. It’s two times faster to read the first 10^6 entries, than the last 10^6. I’m not sure, how this is implemented in TTree, probably, this is due to the large file size.

My .root file is ~2.2GB…however in my case the difference between the first and last 1mln events
is dramatic – see above.

[quote=“aregjan”]My .root file is ~2.2GB…however in my case the difference between the first and last 1mln events
is dramatic – see above.[/quote]

I’ve added more branches, and now file size is >= 3GB. It’s a bit slower now, but difference between the first 10^6 and the last even decreased - it’s ~30 % now.

Can you show the minimal code, reproducing your timings? And give your machine specs.

Ok, I rerun a smaller version of the code, the results are below:

root [0] TFile *_file0 = TFile::Open("/home/aregjan/data_vx511/root_pnpf1/run646647.root")
root [1] int events=ptree->GetEntries();  
root [2] cout<<events<<endl; 
15414423
root [3] system("date");for(int i=0;i<1e+6;i++) { ptree->GetEntry(i);}; system("date")              
Wed Jan  5 12:24:31 EST 2011
Wed Jan  5 12:25:28 EST 2011
(const int)0
root [4] system("date");for(int i=(events-1e+6);i<events;i++) { ptree->GetEntry(i);}; system("date")
Wed Jan  5 12:26:02 EST 2011
Wed Jan  5 12:36:01 EST 2011

As you see, a difference of ~10x

Machine info:
Core 2 duo 2.40GHz, Dell E4300 notebook
Bus speed: 1066Mhz
4GB of memory
Running 64bit Ubuntu 10.04

Hmmm, very nice and useless “dump”, I can produce something like this by hands:

Does it prove anything?

If your code is a top secret, try these two macros, if you still see the same difference
(fix them, if needed, I just write them here, without executing):

//tree_fill.C
void tree_fill()
{
    TFile f("tree.root", "recreate");
    TTree * t = new TTree("aaa", "aaa");
    int arr[1000] = {};
    t->Branch("arr", arr, "arr[1000]/I");
    for(int i = 0; i < 15000000; ++i)
       t->Fill();

    t->Write();
}
//tree_read.C

void tree_read()
{
    TFile f("tree.root");
    TTree * t = (TTree*)f.Get("aaa");
    if(!t)
    {
        std::cout<<"FUUUUUUUUUUUUUU!\n";
        return;
    }

    int arr[1000] = {};
    t->SetBranchAddress("arr", arr);
    TStopwatch timer;
    timer.Start();
   // for(int i = 13999999; i < 15000000; ++i)
    for(int i = 0; i < 1000000; ++i)
        t->GetEntry(i);
    timer.Stop();
    std::cout<<"Time is: "<<int(timer.RealTime())<<std::endl;
}

The output shows that the first 1mln entries were read in 57sec, and the last
1mln were read in 10min. How’s this not obvious?

[quote=“aregjan”]The output shows that the first 1mln entries were read in 57sec, and the last
1mln were read in 10min. How’s this not obvious?[/quote]

And my “output” shows, that both work 10 s, is it obvious?
Without any code reproducing the problem, it’s useless to discuss what and how you measure.
Did you try my macros?

What do you mean no code?? The root commands that I used are up there – replace my run646647.root with your lovely .root file, cut-and-paste the commands, and see what you get.

When I told you about my file sizes and time difference today, this was my lovely root file and my brilliant macros I gave you. So, the problem you are talking about, exists only with your code and not reproducible with simple macros.

There is no “my code”. There are simple two CINT commands. Which show that it takes CINT 10x longer to process last 1mln entries than the first 1mln entries.

If you want to contribute something to this – rather than get hang up in ad hominem exchanges – then I would suggest that you run those very same commands (by cutting and pasting) on your own .root file, and post the output here.

… duplicate message was removed …

[quote=“tpochep”][quote=“aregjan”]There is no “my code”. There are simple two CINT commands. Which show that it takes CINT 10x longer to process last 1mln entries than the first 1mln entries.

If you want to contribute something to this – rather than get hang up in ad hominem exchanges – then I would suggest that you run those very same commands (by cutting and pasting) on your own .root file, and post the output here.[/quote][/quote]

Listen, I do not have your file, I do not have your tree structure. And in your “code”, which I should copy paste, you even did not set branch address - are you familiar with TTree’s internals and know exactly, what happens?

That’s my output with your “two CINT commands” (I repeated them several times):

root [0] TFile f("tree.root") root [1] TTree * t = (TTree *)f.Get("a") root [2] system("date");for(int i=13999999;i<15000000;i++) { t->GetEntry(i);} system("date") Wed Jan 5 22:03:04 CET 2011 Wed Jan 5 22:03:18 CET 2011 (const int)0 root [3] system("date");for(int i=13999999;i<15000000;i++) { t->GetEntry(i);} system("date") Wed Jan 5 22:03:24 CET 2011 Wed Jan 5 22:03:39 CET 2011 (const int)0 root [4] system("date");for(int i=13999999;i<15000000;i++) { t->GetEntry(i);} system("date") Wed Jan 5 22:03:41 CET 2011 Wed Jan 5 22:03:56 CET 2011 (const int)0 root [5] system("date");for(int i=0;i<1000000;i++) { t->GetEntry(i);} system("date") Wed Jan 5 22:04:06 CET 2011 Wed Jan 5 22:04:14 CET 2011 (const int)0 root [6] system("date");for(int i=0;i<1000000;i++) { t->GetEntry(i);} system("date") Wed Jan 5 22:04:16 CET 2011 Wed Jan 5 22:04:24 CET 2011 (const int)0 root [7] system("date");for(int i=0;i<1000000;i++) { t->GetEntry(i);} system("date") Wed Jan 5 22:04:38 CET 2011 Wed Jan 5 22:04:47 CET 2011 (const int)0 root [8]

So, looks like the problem is in your tree.

Well, this is confirming my observation: that the last 1mln entries take longer to cycle through than the first 1mln. Sure, in your case the difference is 1.5x versus my 10x, but that can depend on a combination of the tree’s complexity, disk speed and particulars of TTree::GetEntry()'s implementation.

Hi,

can you give me access to your file, please? E.g. is it somewhere in AFS?

Cheers, Axel.

Axel, it’s a 2.2GB file, not on AFS. I am not sure how I could transfer it to you…does CERN have an ftp server where I could upload it to?