Thread-safe way to write to TFile (1 file per thread)

I have a huge input TTree. I open with an RDataFrame and, given the unusual structure of the TTree (each branch is a vector and different branches index differently other branches) I use a ForEachSlot to process each entry where my custom function handles the unusual structure, and I profit from the implicit multi-threading this way.

I do, though, want to write out a ‘flat’ TTree with the usual structure (each branch is a simple type, not a vector). I tried giving each thread a separate TFile/TTree to write to, but I keep on encountering race conditions where, I assume, the threads are overlapping in their calls to:

file[i_thread].cd()
tree[i_thread].Fill()

and I’m not sure how to make this writing process thread-safe. My alternative has been simply to avoid TFile writing altogether and instead each thread now writes to a CSV and I convert to ROOT in a separate program. But this is a pain because I lose all ROOT’s nice compression features and introduce an extra ‘conversion’ step in my workflow.

What can I do?


ROOT Version: 6.26/10
Platform: Arch
Compiler: Not Provided


See Multi-threading - ROOT

Hi @bellenot , thank you! I read this some time ago.

TTree and TFile are listed there as being “conditionally thread safe”. If I understand the text, that means that separate threads should be able to write to their own dedicated TTrees in their own TFiles.

I would therefore do the following on a machine with 10 threads:

  • prepare a vector of 10 TFiles
  • prepare a vector of 10 TTrees, with 10 sets of branch variables to assign
  • allow each thread to set the values of its own branch variables and fill its own TTree. Each thread would call TFile::cd()… do it’s calculations, set its branch variables, and then call TTree::Write() for its own TTree/TFile.
  • at the end of execution, close the 10 TFiles.

Would this be respecting what is meant by “conditional thread safety”? This is the configuration I had, which led to crashes.

In principle, I think that should work, but maybe @vpadulan has more useful feedback

OK, good. This didn’t work (or at least I may have implemented it wrongly). I’ll wait to hear from @vpadulan and if they too think this should work then I’ll try to make a reproducer.

Hi @danj1011 ,

I guess you are calling ROOT::EnableImplicitMT or ROOT::EnableThreadSafety() before the multi-thread part starts, right? If so, it would be useful to have a minimal reproducer in order to debug.

The other thing I can suggest is to use a sequence of Defines, Redefines to manipulate the data and then a Snapshot instead of doing everything in a custom ForeachSlot: this way it’s RDF that takes care of the multi-thread writing.

Cheers,
Enrico

Yes (to the implicitMT question) I’m doing that. I’ll try to set up a reproducer.

Defines/Snapshot would be my default way to deal with this, but my particular ntuple has a vector per event(=branch) where the vectors are of different lengths and I need to be able to extract indices stored in one vector and use them to access the i’th element of another vector, and I got the impression that this would be beyond the capability of a Define string. Does that sound correct?

You can do pretty much anything in a Define. At a certain level of complexity putting everything into an inline string becomes awkward but you can always write a small C++ function (ideally in its own C++ source file that you pre-compile and load in PyROOT, but you can also do everything from Python) and call that.
Here is an example: ROOT: tutorials/dataframe/df106_HiggsToFourLeptons.py File Reference

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.