I’m experimenting with TBufferMerger, and one thing I noticed is that when using it with a single TBufferMergerFile it behaves quite differently from a TFile. With a TFile the content of my tree is dumped to file regularly, and i see the file size grow while the memory consumption remains more or less steady; With the TBufferMergerFile it’s the opposite, the file size remaining ~0 and the memory occupation growing steadily.
I guess that TBufferMergerFile tries to keep as much data as possible in memory before flushing it to file, is it correct? If yes then I have two questions:
when is data flushed to disk? E.g. after reaching a maximum memory occupation, at the end of processing…
is it possible to set a maximum memory occupation which when surpassed triggers the flush-to-disk? I tried with bufferMerger->SetAutoSave(32000) but I don’t see any difference with respect to the default.
ROOT Version: 6.26/04 Platform: ArchLinux Compiler: Current: GCC 14.1.1, Used to compile Root: GCC 12
Then, TBufferMerger, when calling Merge, accumulates the info from different the TMemFile and writes it to disk (when it considers, or when destructor is called).
According to the documentation for master version, TBufferMerger has no public Write or Merge methods. Do I really have to have enough RAM to contain all my data in order to be able to write to TFile from multiple threads?
Thanks for the suggestion, but I doubt that requesting a feature could lead to something usable in a decent time frame. TBufferMerger is here since long ago (since Root 6.10, if I’m correct), and if the feature I need is still missing then probably it’s because ~nobody needs it, and in my experience the Root dev team focuses its limited manpower on features needed by several people or large comunities.
Anyway I looked at the code of TBufferMerger and it looks quite simple, the bulk of the work being done by TFileMerger. I guess I can write a similr merger class that merges the trees of the attached TMemFiles regularly, dumps the merged content to disk to the output file, then resets the trees by e.g. calling TTree::ResetAfterMerge for all of them, and then repeating. Could this work? Any advice from the experts?
Hi @Nicola_Mori, going back to your original questions:
As mentioned, TBufferMergerFile inherits from TMemFile. It will only flush its contents on Write(), which also triggers the merging in TBufferMerger, resets the tree and releases accumulated memory.
There is no automatic option to specify a maximum memory occupation. The best way to achieve this is manually calling Write() “as necessary”. This can be quite tricky to get right (not Write() / merge too often, but also not exploding memory either); I would recommend starting off with the heuristic RDataFrame uses for parallel Snapshot()s, something along the lines of
// This is similar to the RDF code in SnapshotHelperMT, except that it
// replaces the condition entries % autoFlush == 0 with the more general
// check entries >= autoFlush.
auto entries = tree->GetEntries();
auto autoFlush = tree->GetAutoFlush();
if (autoFlush > 0 && entries >= autoFlush) {
file->Write();
}
@hahnjo thanks for the tip, it works quite well. But according to my tests the memory occupation does not drop when calling TMemFile::Write after accumulating N events, it just stops to increase. It looks like the memory buffers of TMemFile are reset but not released, so that they can be used for subsequent events without needing to allocate more but without actually freeing memory. Does this sound reasonable to you?
Hi @Nicola_Mori, looking into the implementation of TMemFile::ResetAfterMerge (which is called internally) the blocks of allocated memory are indeed kept alive and later reused. Based on that, reaching a maximum memory occupation sounds reasonable and expected.
Just to confirm, is this a problem in your use case where you would need to free the allocated memory? In the worst case, you could release the old TBufferMergerFile, which should really free the allocated memory, and then create a new one.
Hi @hahnjo, I’d need at least a mechanism to release the memory after Write if the system experiences RAM pressure. Replacing the old TBufferMergerFile with a fresh one would currently be quite cumbersome in my API: files are owned by a central entity and provided to consumers on demand, so I miss a mechanism to “replace” a deleted file with the new one in the consumer code. I think I can work around this but if there would be a mechanism to free the TBufferMergerFile buffers without having to destroy it (e.g. a parameter for the Write method) then it would be much more desirable. Is there any way to accomplish this?
Right, that was the one use case I could think of where this might be helpful. I’ll let @pcanal comment on the internals of TBufferMergerFile and TMemFile in particular, I’m not sure if we can optionally release the memory.
For your API, how are the files provided to the customers? If they are still in their shared_ptr, you could try reset()ing the owned pointer in the client code as well…
Unfortunately it’s legacy code and the file is passed over as a raw pointer. I’ll see how to modify the API, but hopefully Philippe will come up with a better solution.