Memory leak with BuildIndex method, for a big TChain


ROOT Version: 6-12-04
Platform: Ubuntu 18-04 (with virtual machine installed on Windows 10)
Compiler: Not Provided


Hi everybody !

I’m working on a virtual machine (Virtual Box) in which is installed Ubuntu and root 6-12-04. In order to sum up the work I want to realize, here are my different steps:

  • Open a root file to extract the big TChain inside (same branches for all TTrees) ;
  • Disable all branches / enable interesting branches, named : “Energy” , “Timestamp” , “Channel” ;
  • Select interesting entries, with a TCut on Energy ;
  • Realize a sortage on Timestamp (with BuildIndex() ? ) ;
  • Data processing

My script, with some comments :

void Energy_Thresholds()
{
   cout<<endl<<"-----------------------------------------------------------------------------" << endl;
   cout<<endl<<"----------------------- Energy Thresholds ------------------------" << endl;
   cout<<"-----------------------------------------------------------------------------"<<endl;
   
   clock_t time_1 = clock();

   // Open the root file with the TChain "run_6.root" 
   TFile* original_File = TFile::Open("run_6.root");
   TTree* original_Tree = (TTree*)original_File->Get("Filtered");

   cout << "number of entries :  " << original_Tree->GetEntries() << endl; // ~ 170m entries

   original_Tree->SetBranchStatus("*" , 0); // disable all branches
   original_Tree->SetBranchStatus("Channel" , 1); // I have to keep this information for later
   original_Tree->SetBranchStatus("Energy" , 1); // enabled for the energy thresholds
   original_Tree->SetBranchStatus("Timestamp" , 1); // enabled for the sortage, and the draw of an histogram 

   // new file, "run_6_1.root" , in order to copy the Tree inside, with some conditions on Energy
   TFile* output_File = TFile::Open("run_6_1.root" , "RECREATE");
   TTree* output_Tree = original_Tree->CopyTree("Energy>80 && Energy<140"); // between 80 and 140 -> ~55m entries
   
   // CopyTree juste create a line in the TFile as following : 
   // OBJ:     TTree     Filtered     TTree with filtered sorted events : 0 at: 0x5654b33e3360    ->   in RAM?
   output_File->Write(); // The new Tree will be written , and will stay in the output_File when closed.  -> In Disk
       
   clock_t time_2 = clock();
   cout << "process time for thresholds: " << ((float)(time_2-time_1) / CLOCKS_PER_SEC) << "seconds" << endl;
   // This function seems to be a little bit faster than function with TEventList method , or a Fill method with "if" statement inside a "for" loop 
   
   delete original_File; // or original_File->Close() and output_File->Close() ?
   delete output_File;  
}


TTreeIndex* Sort_TS()  // It can also be a void function. The TTreeIndex will be written, and can be read after
{
   cout<<endl<<"-----------------------------------------------------------------------------" << endl;
   cout<<endl<<"------------------------ Sortage by increasing Timestamps ------------------------" << endl;
   cout<<"-----------------------------------------------------------------------------"<<endl;
 
   clock_t time_1 = clock();

   TFile* original_File = TFile::Open("run_6_1.root" , "UPDATE");
   TTree* original_Tree = (TTree*)original_File->Get("Filtered");
 
   // HERE is my main problem ... I try to create a BuildIndex on Timestamp ...
   original_Tree->BuildIndex("Timestamp" , "Channel"); // A sort on Timestamp, but I will keep Channels in the TTreeIndex object
   original_Tree->Write(); // Save the TTree in current File, with its new index

   TTreeIndex* ind = (TTreeIndex*)original_Tree->GetTreeIndex(); // Check if the new Index is built, and if we can get it
   cout << ind->GetN() << " entries in the TTreeIndex" << endl;

   ind->Print("10"); // a little test to check if we have what we want 

   clock_t time_2 = clock();
   cout << "process time : " << ((float)(time_2 - time_1 ) /CLOCKS_PER_SEC) << " seconds" <<endl;

   // Here, I'm lost ... I can't "return ind" after the "delete original_File" because the TTreeIndex* pointer doesn't exist anymore ...
   // It's pretty bad, right ? I'm looking for an other way to do this  (void function for example)
   return ind ; // I want to return it, in order to call it as a parameter in the following process function. So, it stays in RAM all the time ? 
   delete original_File;
}

void Process(TTreeIndex* in_ts)
{
   // my process with TTreeIndex 
}


int main () {
   Energy_Thresholds();
   TTreeIndex* ind = Sort_TS();
   Process(ind);

   return 0;
}

Sometimes, the TChain have 170 millions entries, and the Energy TCut gives more than 55 millions entries remaining. For these cases (unfortunately I didn’t find a way to get the size of it …) , the call of BuildIndex leads to an exception : std::bad_alloc .
More precisely : " Error in TRint::HandleTermInput(): std::bad_alloc caught: std::bad_alloc "

Am I really bad with my pointers management ? Is it possible that my allocated RAM for the virtual machine (~3000 Mo) is not big enough ? Is there a way to make a kind of chunk sizing ?

Two examples in the prompt, one working (TCut E>135 and E<140) , and not the other … (TCut E>80 and E<140) :

Energy Thresholds
number of entries : 170253997
process time for thresholds: 68.9602seconds

Sortage by increasing Timestamps
2053011 entries in the TTreeIndex

serial : Timestamp : Channel

   0 :         578174060 :                6
   1 :         659930369 :                3
   2 :         1023391193 :                0
   3 :         1602030416 :                7
   4 :         1677231365 :                7
   5 :         1768918345 :                2
   6 :         2461758888 :                7
   7 :         2552985037 :                5
   8 :         2971680275 :                7
   9 :         3318579685 :                0

process time : 3.46273 seconds

Processing
Info in TCanvas::MakeDefCanvas: created default TCanvas with name c1
(int) 0

Energy Thresholds

number of entries : 170253997
process time for thresholds: 146.94seconds

Sortage by increasing Timestamps
Error in TRint::HandleTermInput(): std::bad_alloc caught: std::bad_alloc

I’ve already found topics about similar problems on this forum :
“Create a new index for a TTree”
“BuildIndex aborts for a large TTree”

but it didn’t help me a lot …

Somebody have an idea to help me go further ?
Thanks in advance for your help!

Erik

You have a TChain - but let’s assume that it would be a TTree for now. https://root.cern.ch/doc/master/TTreeIndex_8cxx_source.html#l00123 shows that BuildIndex needs five arrays of type Long64_t with a the size being the number of entries. For your case that’s 17025399758 bytes = 6810159880bytes = 6.8GB.

Can you maybe filter the tree instead? See tutorials/tree/copytree3.C.

Cheers, Axel

Axel,

Thanks for your reply !
Also, thanks for the way to calculate sizes. As you noticed, I’m not very familiar with this.

I just made a Filter function based on the copytree3.C process :

void Filter()
{
   TFile* old_file = new TFile ("run_6.root");
   TTree* old_tree = (TTree*)old_file->Get("Filtered");
   old_tree->Print();

   Long64_t n_entries = old_tree->GetEntries();

   old_tree->SetBranchStatus("*" , 0); 
   old_tree->SetBranchStatus("Channel" , 1); 
   old_tree->SetBranchStatus("Energy" , 1); 
   old_tree->SetBranchStatus("Timestamp" , 1);

   UShort_t En = 0;
   old_tree->SetBranchAddress("Energy" , &En);

   TFile* new_file = new TFile ("run_6_1.root" , "recreate");
   TTree* new_tree = old_tree->CloneTree(0);

   for (Long64_t i=0; i < n_entries; i++) {
      old_tree->GetEntry(i);
      if (En > 80 && En < 140 ) new_tree->Fill();
   }

   cout << endl << "new Tree : " << endl;
   new_tree->Print();
   new_tree->AutoSave();

   delete old_file;
   delete new_file;
}

I don’t really know how to do my Timestamp sortage then, but I have the feeling that it’s possible to do this inside the “for” loop. Is this what you meant ?
I’m thinking about a projection of all Timestamps in a TEventList with Draw() , and a sortage with TMath::Sort() (outside the loop of course), and then, a tree->Fill() inside the loop.

I noticed that tree->Fill() is unfortunately quite slow … The method with TEventList and TMath::Sort() works fine and quickly, but filling the final tree is very long …

Thanks in advance !

Regards,

Hi,

Sorry I failed to see the TChain entry sorting in your earlier example. Do you really really really need to sort the entries? This is fairly involved; reading that tree in a random order of tree entries (i.e. the result of the timestamp sort) is really inefficient.

Are the trees in each file sorted by timestamp? Then do the following, similar to copytree3.C: you open each file, extract each tree (how many do you have?) then you create an output tree, and fill the input entry into the output tree that has the lowest timestamp. With that you’re stepping through the input trees, sorting them, and you do this once and afterwards you only ever iterate over the sorted, combined output tree.

How does that sound? If the approach makes sense (and my assumption holds) then we can get the code figured out :slight_smile:

Cheers, Axel.

Hi !

It’s not necessary for me to sort the entries, but I need to know (or recreate) the exact increasing order. That’s why BuildIndex() sounded good for me. In fact, I have to calculate all differences in time between each neighbours Timestamps.
If there is a way to do this with a reading of the big tree in the right order (without sortage), I didn’t manage to find it.

Unfortunately, neighbours Timestamps are not always in the same TTree, and trees in each file are not sorted by Timestamp … (only some fractions of entries are sorted, but not the whole TTree).
For this Chain, I have 110 Files, 1 TTree by File (1548217 entries each).

Your solution sounds really good !! Thank you. I think I understand the principle.
Maybe it’s possible to sort all of the little trees, and then, apply your method ?

Cheers !

How often do you re-analyze the tree once you have determined the order?

Once the order is done, I just realize one calculation of all delta timestamps in order to draw an histogram. But I will need to keep the channel number, in order to make another filter on my data, in function of the current Delta Timestamp value and the Delta Channel value.

I’m sorry if it’s not clear : In fact, when I calculate a delta Timestamp, if it’s under a certain threshold, I have to verify another condition on corresponding delta channels. If all conditions are verified, I keep the delta Timestamp value for the histogram.

For other applications, with a similar big TTree, I need to create an algorithm which can give me a list of coincidence windows, and the multiplicity counting (with 2 differents methods). I’ve already coded this in Python, but this is not enough fast and I will do this in C++ :wink: I’m pretty sure it will be the easier part when I’m
able to organize my data.

Hi Erik,

That makes sense: create an index for each file separately - that should be fine, memory-wise. Store it e.g. with the tree.

Then open all files, and always pick the next element from the index with the lowest time stamp, writing into a new tree (that you could split across multiple file as TTree does it automatically.)

Let me know if you need help with creating the code that does this!

Cheers, Axel.

Hi Axel,

I was very busy these last days but I started to code the script which will do this. I will come back soon with my work, and I will not hesitate if I encounter some difficulties.

Thank you again !
Cheers,

Erik

Hi Axel,

I finished the script, but I still have a little problem …
My code is the following :

bool RootFileExists(TString FName)
{
ifstream file(FName); 

  if(file)
  {	  
   return true;
  }  
  else return false;
}




int Create_indexes()
{

   TString	base="run_";
   int 		numRun = 0;
   int 		count=1;
   TString	wildcard=".root";

   int 		number_of_files = 1;
   int  	n = 0 ; 



   //the first file always exists : "run_0.root" : 
   TString	first_fileName = base + Form("%d",numRun) + wildcard;
   
   TFile* first_f = new TFile (first_fileName , "update");
   TTree* first_t = (TTree*)first_f->Get("Filtered");
   n = first_t->GetEntries();
   first_t->SetEstimate(n);
   first_t->BuildIndex("0" , "Timestamp");
   cout << first_t->GetEstimate();
   first_t->Write();
   delete first_f;


   //other files in folder: 
   TString	fileName = base + Form("%d",numRun) + "_" + Form("%d",count) + wildcard;
   while(RootFileExists(fileName) == true) {
    
      TFile* f = new TFile(fileName , "update");
      TTree* t = (TTree*)f->Get("Filtered");
      n = t->GetEntries();
      t->SetEstimate(n);
      t->BuildIndex( "0" , "Timestamp");
      t->Write();

      //for the following iteration :
      count ++;
      number_of_files++;
      cout << " Index built for file " << fileName << endl;
      fileName = base + Form("%d",numRun) + "_" + Form("%d",count) + wildcard;
    
      delete f ;

   }

   return number_of_files; 

}




void Create_final_TTree(int number_of_files)
{
   TString	base="run_";
   int 		numRun = 0;
   int 		count=1;
   TString	wildcard=".root";


   int	current_index_number_for_all_files [number_of_files];
   for (int i = 0 ; i < number_of_files ; i++ ) {
      current_index_number_for_all_files[i] = 0; // [0,0,0,0, ... ,0]
   }

   TTreeIndex* 	list_of_Indexes[number_of_files]; 
   int		iterate_in_while = 1;



   //First index : 
   TString	first_fileName = base + Form("%d",numRun) + wildcard;
   TFile* first_f = new TFile (first_fileName , "update");
   TTree* first_t = (TTree*)first_f->Get("Filtered");
   TTreeIndex* first_I = (TTreeIndex*)first_t->GetTreeIndex();
   list_of_Indexes[0] = first_I ;

   first_t->SetBranchStatus("*" , 0);
   first_t->SetBranchStatus("Channel" , 1); 
   first_t->SetBranchStatus("Energy" , 1); 
   first_t->SetBranchStatus("Timestamp" , 1); 
   ULong64_t TS = 0;
   first_t->SetBranchAddress("Timestamp" , &TS);


   //copy the first tree in order to create an empty final tree to fill later. 
   TFile* final_File = new TFile ("final_tree.root" , "recreate");
   TTree* final_Tree = (TTree*)first_t->CloneTree(0);
   //final_Tree->Write();


   //other files : 
   TString	fileName = base + Form("%d",numRun) + "_" + Form("%d",count) + wildcard;  
   while(RootFileExists(fileName) == true) {  

      TFile* f = new TFile (fileName);
      TTree* t = (TTree*)f->Get("Filtered");
      //first_t->AddFriend(t); // apparently not needed
      TTreeIndex* I = (TTreeIndex*)t->GetTreeIndex();
      list_of_Indexes[iterate_in_while] = I ; 
   
      iterate_in_while ++ ;
      count ++ ;
      fileName = base + Form("%d",numRun) + "_" + Form("%d",count) + wildcard;  

   }


   ULong64_t	comparison_List[number_of_files];
   ULong64_t	index_min_value = 0;
   
   cout << "filling the final tree : " << endl;
   //for (int i = 0 ; i < total_entries ; i ++ ) {
   for (int i = 0 ; i < 450; i ++ ) { // less entries for tests

      for (int j = 0 ; j < number_of_files ; j++) {
         comparison_List[j] = list_of_Indexes[j]->GetIndexValuesMinor()[current_index_number_for_all_files[j]];
         cout << comparison_List[j] << endl;
      }

      index_min_value = TMath::LocMin (number_of_files , comparison_List);

      // test  (problem with Int_t , Long64_t ... )
      cout << comparison_List[index_min_value] << " = " << Int_t(comparison_List[index_min_value]) << " ? " << endl;

      first_t->GetEntryWithIndex( 0 , comparison_List[index_min_value]);
      final_Tree->Fill();
      cout << "good TS ? " << TS << endl;
      

     // PROBLEM ... GetEntryWithIndex() takes Int_t and not Long64_t as parameters.
     // For big values of Timestamps, comparison_List[index_min_value] becomes something else, and we can't get the entries anymore
     // "TS" stays a reference to the last timestamp which has not been transformed (Long64_t -> Int_t)
     // So, all following filled entries are the same .... 

      current_index_number_for_all_files[index_min_value] ++ ; 

   }

   final_File->cd();
   final_Tree->Write();   

   delete final_File;

}

It works, but for big values of TS, there is a problem with Long64_t and Int_t …

GetEntryWithIndex() takes Int_t as parameters and this type of number can’t read correctly my big values …
I have found this topic, talking about the same problem :
https://sft.its.cern.ch/jira/si/jira.issueviews:issue-html/ROOT-9028/ROOT-9028.html

I tried different parameters for BuildIndex, with Timestamp as major, or minor … do not work.

Is there another way to access my entries correctly, when Timestamps are big ?
I would be happy to get your feedback concerning my code :wink:

Thanks in advance !
Cheers,
Erik

Thanks to this topic :

I use :
Long64_t entry = first_t->GetEntryNumberWithIndex( 0 , comparison_List[index_min_value])
first_t->GetEntry(entry)

Instead of:
first_t->GetEntryWithIndex( 0 , comparison_List[index_min_value]);

and it works !
Now, I’m going to make some tests about the execution time !

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.