Random sample of entries

Dear root experts,

I am currently facing a performance issue which I wonder if there is a nice work around for it. I am trying to loop over a small sample (10000events) of a NTUP (~10mio events) since there is a structure within the NTUP I would like to pick a randomly 10000 events. Here the way I implemented it:

import random
r=random.sample(xrange(nEntries),1000)
#tree.LoadBaskets() # this is something I found in a forum but to just execute this line takes 30+ min
for i in r:
tree.GetEntry(i)

if I do this the time to loop over each event is about 100times higher as if I just loop in order (1,2,3,4,5, …) over the events

thanks for your help
Michael

Try to sort entry numbers of your sample (in an increasing order) before you enter your “for” loop.

suggestion: use the TBits class
-create a TBits bits(10000000)

  • get 10000 integer numbers between 0 and 10 millions
    -for each integer use bits.TestBitNumber to check if you have already selected this particular entry and call bits.SetBitNumber otherwise
    -then loop on the bits object using bits.FirstSetBit(startbit)
    -call tree.GetEntry with the value returned by FirstSetBit

Rene

Hi Rene,

I am not quite sure how to implement your idea
I can create the 10000 integer numbers and they are unique so I think I dont have to check but I dont quite understand the other parts
for each integer I call bits.SetBitNumber(integer)?
how do I loop afterwards? and what is startbit?
an example like
for i in bits…:
tree.GetEntry(bits.FirstSetBit(i))

would be nice
thanks Michael

see example below

Rene

void bits() {
   TFile *file = TFile::Open("keep_findall_123668.root");
   TTree *T = (TTree*)file->Get("T");
   Long64_t nentries = T->GetEntries();
   printf("tree has %d entries\n",nentries);
   TBits *bits = new TBits(nentries); //see http://root.cern.ch/root/html534/TBits.html
   //now assume you want to read N=1000 entries out of nentries
   Int_t N = 1000;
   Int_t i=0;
   TRandom3 rand(0);
   while(i<N) {
      Int_t j = rand.Uniform(0,nentries);
      if (bits->TestBitNumber(j)) continue;  //we have already seen this entry
      bits->SetBitNumber(j);
      i++;
   }
   //now loop on the N selected entries
   Int_t fbit = 0;
   for (i=0;i<N;i++) {
      j = bits->FirstSetBit(fbit);
      //read this Tree entry
      T->GetEntry(j);
      //process entry
      printf("reading entry %d\n",j);
      //.....
      fbit = j+1;
   }
}

OK danke das gibt die ziemlich gleiche performance wie das vorherige sortieren der random list of integres

–Michael

sry wrong language

the performance is about the same with sorting the random list first or the bits so I stick with the sort for now

thanks for your help
Michael

[quote=“mstoebe”]
if I do this the time to loop over each event is about 100times higher as if I just loop in order (1,2,3,4,5, …) over the events

thanks for your help
Michael[/quote]

Do you mean that looping over 10000 random events is much slower than looping over the first 10000 events?

This would not be too surprising given how disk access works (sequential is faster), and possibly the way ROOT compresses the files (contiguous events may be compressed together? Don’t know details), and maybe memory caching considerations.

Hi,

Yes, the content of ROOT TTree is optimized for sequential reading. Random reads is possible but lead to multiple decompression of the same events (events are bunched in basket for each branches). You will be much better sorting the entry reading them in increasing order. One of the best way to do so is to use a TEntryList.

Cheers,
Philippe.