Write large TTreeIndex to file


ROOT Version: 6.18.04
Platform: CentOS7
Compiler: gcc 8.3.0


Hi, I have two large TChain’s for which I need to build a TTreeIndex and make one friend of the other. I need to run the entire script several times, and it’s apparent that the index building is a major bottleneck. Therefore I’m now saving the indices to file on a first pass and simply reading them on further ones, using the following logic:

string index_file_name = "my_file_with_indices.root";
ifstream _index_file(index_file_name.c_str());
TFile* index_file;
TTreeIndex *first_index, *second_index;
if (_index_file.good()) {
  index_file   = new TFile(index_file_name.c_str(),"READ");
  first_index  = (TTreeIndex*)index_file->Get("first_index");
  second_index = (TTreeIndex*)index_file->Get("second_index");
}
else {
  index_file   = new TFile(index_file_name.c_str(),"RECREATE");
  first_index  = new TTreeIndex(first_chain,  "major", "minor");
  second_index = new TTreeIndex(second_chain, "major", "minor");
  first_index->Write("first_index");                                                                                                                                
  second_index->Write("second_index");
}

before setting my TChain indices with first_index and second_index and befriending the TTrees. The second index corresponds to a particularly large number of entries and so upon writing it I get

Error in <TBufferFile::WriteByteCount>: bytecount too large (more than 1073741822)

I naïvely tried increasing the buffer size with

second_index->Write("second_index",0,1073741822*2);

but this raises

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted

@pcanal Any suggestions?

Writing a single object is indeed capped at 1Gb (some internal pointer/reference are 32bits in the binary representation). Trying to allocate a buffer larger than 1Gb has ‘undefined’ behavior :).

Do your files have non-overlapping range of [first_index, second_index] ?

Hi @pcanal, what do you mean by “non-overlapping range” sorry? ([first_index, second_index] provides a unique identifier across all files of the TChain, if that’s your question)

I mean if you have 2 files you can have either:
(a)

file1:
run#1 event#2
run#3 event#1
file2:
run#2 event#3
run#4 event#1

or (b)

file1:
run#1 event#2
run#2 event#3
file2:
run#3 event#1
run#4 event#1

In the second (b) case, you can generate an TTreeIndex per file and it will work.
In the first (a) case, it will not as it would need to scan/open each file to find the correct entry.

Ah, I see what you mean :slight_smile: My setup is your (b), so yes I could generate one TTreeIndex per file.

But in fact I can reproduce the problem if I consider even a single file with two trees. Specifically, one has 8546147 entries and I can save the corresponding TTreeIndex; whereas the second one has 79825000 entries and raises the <TBufferFile::WriteByteCount>: bytecount too large error above.

Is there a way I could split the writing of the TTreeIndex for a single tree in a single file? Alternatively, how can I write the corresponding information to, say, a csv file, and read it back to create my own TTreeIndex?

Yes.

You can save the result of mytree->GetTreeIndex(); as an individual key in the file (and then call mytree->SetTreeIndex(nullptr); before writing the tree. But then at read time you need to do the converse (read individually and then re-attach it).

What we really need is a change in TTree::Streamer that makes this process automatic (and split it over several keys if needed).

You can save the result of mytree->GetTreeIndex(); as an individual key in the file (and then call mytree->SetTreeIndex(nullptr); before writing the tree.

I’m probably misunderstanding something here, because that doesn’t seem to help…

What we really need is a change in TTree::Streamer that makes this process automatic

Is that something that could happen soonish? :slight_smile:

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.