Implicit maximum on TTree size to be written to a TFile?

I have a program which is in quasi code:

TChain* ch = new TChain("my_super_tree_name")
for (fname : input_filenames) {
  ch->Add(fname);
}
TFile* dummy = new TFile("/tmp/buffer.root","recreate");
TTree* buffertree = ch->CopyTree( some_cuts );
TTree* newtree = new TTree();
// add some branches to newtree
Long64_t entries = buffertree->GetEntries();
entries = std::min((Long64_t)2000000,entries); /// REMOVEME
for (Long64_t i = 0 ; i < entries ; ++i) {
  buffertree->GetEntry(i);
  // do some computations go have values for newtree's branches
  newtree->Fill();
  // draw nice progress bar
}
std::cout << "done looping" << std::endl;
TFile* of = new TFile(proper_outputfilename, "recreate");
of->WriteTObject(newtree);
of->Close();
std::cout << "done writing" << std::endl;

This runs fine over 2M entries in a few minutes (time says 2min34s). buffertree has about 9M entries, but when trying to loop over the entire tree (i.e. removing the REMOVEME line) I see “done looping” and 9 hours later the outputfile still has a size of less than 1kB and I see no progress (except that the CPU and memory usage of the program is still high).

Suggestions? (other than splitting into several subjobs and keeping in mind not to loop too far…)
Is there a -to me unknown- size limit of how much I can write to a TFile in one go? Should I trigger the writing to file by hand in the loop every 100k entries?

Thanks in advance,
Paul

PS: using root 6.06.00 from afs w/ gcc49

Hi Paul,

First a set of meta comments which are probably moot and only due to the simplification:TTree* newtree = new TTree();For a TTree to be storable, one must give it a name and title. Also the code does not add any branches to the output TTree.

TFile* of = new TFile(proper_outputfilename, "recreate"); of->WriteTObject(newtree); of->Close();Usually one associate the TTree with its output file from the get go. This would ‘work’ as intended if and only if the ‘newtree’ was set as in memory TTree (i.e. not associated to any file) … the consequence is that all the TTree data would have to be able to fit in memory (and large TTree might lead to huge amount of swapping).

TFile* dummy = new TFile("/tmp/buffer.root","recreate"); .... TTree* newtree = new TTree("newtree","new title"); for (Long64_t i = 0 ; i < entries ; ++i) { ... newtree->Fill(); } TFile* of = new TFile(proper_outputfilename, "recreate"); of->WriteTObject(newtree); literally means that the data will be stored in buffer.root (as this is the TFile the TTree get associated with) and the newtree metadata get stored in proper_outputfilename.

Cheers,
Philippe.

Hi,

Now onto the likely source of inefficiency:TFile* dummy = new TFile("/tmp/buffer.root","recreate"); TTree* buffertree = ch->CopyTree( some_cuts ); .... for (Long64_t i = 0 ; i < entries ; ++i) { buffertree->GetEntry(i); means that you literally copy a large fraction of the data onto a temporary file ‘just’ to get the list of entries you want to select. This is a very expensive way.

An approach to keep the selection separate from the actual copy:

TChain* ch = new TChain("my_super_tree_name")
for (fname : input_filenames) {
  ch->Add(fname);
}
TEntryList *entryList = new TEntryList("mysel");
ch->Draw(">>mysel",some_cuts,"entrylist");
ch->SetEntryList(entryList);

TFile* of = new TFile(proper_outputfilename, "recreate");
TTree* newtree = new TTree("newtree","new title");
// add some branches to newtree

for (Long64_t i = 0 ; ; ++i) {
  Long64_t entryNumber = treechain->GetEntryNumber(entry);
  if (entryNumber < 0) break;
  Long64_t localEntry = ch->LoadTree(entryNumber);
  if (localEntry < 0) break;
  ch_branch1->GetEntry(localEntry);
  ch_branch2->GetEntry(localEntry);
  // do some computations go have values for newtree's branches
  newtree->Fill();
  // draw nice progress bar
}
std::cout << "done looping" << std::endl;
of->Write();
of->Close();
std::cout << "done writing" << std::endl;

but even better is to do the selection inline:

[code]TChain* ch = new TChain(“my_super_tree_name”)
for (fname : input_filenames) {
ch->Add(fname);
}
TBranch branch1 = nullptr;
ch->SetBranchAddress(branch_name, &ptr_to_dataobject, &ch_branch1);
TFile
of = new TFile(proper_outputfilename, “recreate”);
TTree* newtree = new TTree(“newtree”,“new title”);
// add some branches to newtree

for (Long64_t i = 0 ; ; ++i) {
Long64_t localEntry = ch->LoadTree(entryNumber);
if (localEntry < 0) break;
ch_branch1->GetEntry(localEntry);
ch_branch2->GetEntry(localEntry);

if ( do_not_pass_cut) continue;
// do some computations go have values for newtree’s branches
newtree->Fill();
// draw nice progress bar
}
std::cout << “done looping” << std::endl;
of->Write();
of->Close();
std::cout << “done writing” << std::endl;
[/code]

Cheers,
Philippe.

Hi Philippe,

thanks a lot for the comprehensive answer! Will implement it right away!

Cheers,
Paul

Indeed, that’s why I called it quasi code. But good you corrected it anyways.