Storing a large object in a TFile

Dear ROOTers,

I have a large object (several GB) which essentially consists of a vector<map<TString, map<TString, double>>>. I cannot write this object to a TFile, no matter how. I have tried the following:

  1. Using ClassImp and ClassDef to implement the default Streamer and writing the object using TFile.WriteObject() (from PyROOT, if that matters);
  2. Putting the object in a TTree branch with maximum splitting;

Neither works, for different reasons. I opened a discussion topic about this problem a while ago, where I also posted a minimal reproducer for the bug and the corresponding error messages.

Meanwhile, I have kept working on this and tried a different approach. I have implemented a custom streamer method for my class by basically reimplementing serialization from scratch for all the relevant STL types. I convert my object to a huge std::string and I write it in “small” (< 1 GB) chunks to the buffer using TBuffer::WriteStdString(). I hoped this would be sufficient to work around the limitations of TBuffer, but I keep running into the same segfaults because (I think) the TBuffer keeps expanding. I expected it to behave like a buffer, in the sense that it empties itself when it is at full capacity.

Can anyone please help me work around these limitations? All I want is to write a large object to a TFile. I have done whatever is necessary to be able to serialize the large object in chunks of arbitrarily small size. How do I avoid TBuffer overflowing its maximum capacity of about 1 GB?

Thanks in advance for your help!
Davide

ROOT Version: 6.12
Platform: Ubuntu 16.04
Compiler: gcc 5.4.0

You have to explicitly call Reset to reuse a TBufferFile.

Can anyone please help me work around these limitations?

In you context, I would do the following:

vector<map<TString, map<TString, double>>> veryLargeContainer;
/// fill the container;
map<TString, map<TString, double>> *element = nullptr;

auto f = TFile::Open(filename, "RECREATE");
auto t = new TTree("largedata", "vector split vertically accross entries");
t->Branch("element.", &element);
for(auto &content : veryLargeContainer) {
    element = &content; // or iterate over index and use = &(veryLargeContainer[index])
    t->Fill();
}
f->Write();

and reading

vector<map<TString, map<TString, double>>> veryLargeContainer;
map<TString, map<TString, double>> *element = nullptr;
t->SetBranchAddress("element.", &element);
for(Long64_t e = 0; e < t->GetEntriesFast(); ++e) {
    t->GetEntry(e);
    veryLargeContainer.emplace_back(*e);
}

or

vector<map<TString, map<TString, double>>> veryLargeContainer;
veryLargeContainer.resize(t->GetEntries());
map<TString, map<TString, double>> *element = nullptr;
t->SetBranchAddress("element.", &element);
for(Long64_t e = 0; e < t->GetEntriesFast(); ++e) {
    element = &(veryLargeContainer[e]);
    t->GetEntry(e);
}

Philippe, thank you for your reply,

Will that work if any single map object is larger than 1 GB? Also, why doesn’t the vector split correctly when I add it to the TTree?

why doesn’t the vector split correctly when I add it to the TTree

What do you mean by split correctly?

The splicing that I sketch above (horizontal splitting) is very unusual and has no automatic implementation.

The regular (vertical) splitting is not able to split collection inside a collection. And so indeed the original vector, since it contains a map, is not split.

In my proposal the branch contains now map<TString, map<TString, double>> and this will be split in 2 branches (the key and the value, since the value is a collection (within a collection) it will not be split).

Will that work if any single map object is larger than 1 GB?

Indeed, if one of the map<TString, double> is larger than 1 GB, the code above will still fail (but I still have a few ideas on how to decompose it a bit further :)).

Out of curiosity, what is the type of data (semantically speaking) you are storing? Is the vector of map of map the optimal data structure to use and store this data? What is the data size (per map, per vector and in total) that you expect?

why doesn’t the vector split correctly when I add it to the TTree

What do you mean by split correctly?

I am alluding to the errors I saw when I tried to t->Print() the tree.

Out of curiosity, what is the type of data (semantically speaking) you are storing? Is the vector of map of map the optimal data structure to use and store this data? What is the data size (per map, per vector and in total) that you expect?

The outer vector represents time steps in a (nuclear reactor) fuel depletion calculation. For each time step, the outer map associates the name of the materials to their compositions. Each composition is represented by a map associating isotope names to concentrations. In reality, the inner maps are classes that contain the map<TString, double>, as well as other things.

Typical sizes would be as follow:

  • vector: a few hundred elements
  • outer map: up to a few tens of thousands of elements. The keys are typically <50 chars long.
  • inner map: up to a few hundred elements. The keys are typically <10 chars long.

According to my calculation, the typical size of the outer map is a few GB (counting the actual payload only).

I could save a bit of memory by interning the isotope names and using indices for the mappings, but that would save a factor of 2 at most I think.

With those numbers (for example 30,000 * 5,000 * 10) you are indeed still in the range of 1GB ( 1.4 in the example). Saving a factor 2 would actually make a difference (the example would now fit).

An idea is to split horizontally a bit more, but the code is becoming a bit complex:

vector<map<TString, map<TString, double>>> veryLargeContainer;
/// fill the container;
map<TString, map<TString, double>> *mid_element = nullptr;
map<TString, double> *value = nullptr;
long index = 0;
TString *key = nullptr;

auto f = TFile::Open(filename, "RECREATE");
auto t = new TTree("largedata", "vector split vertically accross entries");
t->Branch("index", &index);
t->Branch("key", &key);
t->Branch("value.", &value);
for(index = 0; index < veryLargeContainer.size(); ++index) {
    auto &midmap = veryLargeContainer[index];
    for(auto &content : midmap) {
        key = &content.first;
        value = &content.second;
        t->Fill();
    }
}
f->Write();

and reading

vector<map<TString, map<TString, double>>> veryLargeContainer;
long index;
TString *key = nullptr;
map<TString, double> *value = nullptr;

t->SetBranchAddress("index", &index);
t->SetBranchAddress("key", &key);
t->SetBranchAddress("value.", &value);

for(Long64_t e = 0; e < t->GetEntriesFast(); ++e) {
   t->GetEntry(e);
   if (index >= veryLargeContainer.size())
       veryLargeContainer.resize(index+1);
   auto &midmap = veryLargeContainer[index];
   mipmap.insert( {*key, *value} ); // or mipmap[*key] = *value;
}

In this case, the biggest element becomes “just” the inner map that will for sure fit.

Cheers,
Philippe.

The outer vector represents time steps

In this case, you could also directly fill the map into the TTree without putting them first in a vector. I.e. Fill the TTree right after each time step has been calculated.

Just a simple suggestion. Wouldn’t a Hashing of Tstring to a longlong help to flatten out your map and then you can simply store a ttree with long, long and double branch ? Then read back the Ttree with a simple reverse hashing of strings?

@pcanal thanks for your feedback. I was hoping there would be a way to coax ROOT into serializing my object without doing the splitting myself. In my real application, the classes are much more complex than the vector<map<TString, map<TString, double>>> I used for the minimal reproducer. They contain a lot of other members. Even though most of the memory is taken up by the maps, I still need to serialize the rest of the members. Also, the 1 GB limit precludes using the classical Streamer() serialization mechanism not only for the large object itself, but also for any other object that either contains the large object or references it via a smart pointer, for example.

@RENATO_QUAGLIANI I’m not sure what you mean by reverse hashing. I do not know in advance the TStrings that can appear in my maps. If you mean storing the TStrings in a vector alongside the container and storing the indices in the container, then yes, this is what I meant by “interning” up here.

@arekfu RNTuple (a successor to TTree) is able to split collection of collections so it might do a much better job here. I am also working on lifting the 1GB limit but this is slow going.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.