Thread-safe way to write to TFile (1 file per thread)

danj1011 · January 14, 2023, 7:52am

I have a huge input TTree. I open with an RDataFrame and, given the unusual structure of the TTree (each branch is a vector and different branches index differently other branches) I use a ForEachSlot to process each entry where my custom function handles the unusual structure, and I profit from the implicit multi-threading this way.

I do, though, want to write out a ‘flat’ TTree with the usual structure (each branch is a simple type, not a vector). I tried giving each thread a separate TFile/TTree to write to, but I keep on encountering race conditions where, I assume, the threads are overlapping in their calls to:

file[i_thread].cd()
tree[i_thread].Fill()

and I’m not sure how to make this writing process thread-safe. My alternative has been simply to avoid TFile writing altogether and instead each thread now writes to a CSV and I convert to ROOT in a separate program. But this is a pain because I lose all ROOT’s nice compression features and introduce an extra ‘conversion’ step in my workflow.

What can I do?

ROOT Version: 6.26/10
Platform: Arch
Compiler: Not Provided

bellenot · January 15, 2023, 11:58am

See Multi-threading - ROOT

danj1011 · January 16, 2023, 10:39am

Hi @bellenot , thank you! I read this some time ago.

TTree and TFile are listed there as being “conditionally thread safe”. If I understand the text, that means that separate threads should be able to write to their own dedicated TTrees in their own TFiles.

I would therefore do the following on a machine with 10 threads:

prepare a vector of 10 TFiles
prepare a vector of 10 TTrees, with 10 sets of branch variables to assign
allow each thread to set the values of its own branch variables and fill its own TTree. Each thread would call TFile::cd()… do it’s calculations, set its branch variables, and then call TTree::Write() for its own TTree/TFile.
at the end of execution, close the 10 TFiles.

Would this be respecting what is meant by “conditional thread safety”? This is the configuration I had, which led to crashes.

bellenot · January 16, 2023, 10:42am

In principle, I think that should work, but maybe @vpadulan has more useful feedback

danj1011 · January 16, 2023, 10:43am

OK, good. This didn’t work (or at least I may have implemented it wrongly). I’ll wait to hear from @vpadulan and if they too think this should work then I’ll try to make a reproducer.

eguiraud · January 16, 2023, 4:20pm

Hi @danj1011 ,

I guess you are calling ROOT::EnableImplicitMT or ROOT::EnableThreadSafety() before the multi-thread part starts, right? If so, it would be useful to have a minimal reproducer in order to debug.

The other thing I can suggest is to use a sequence of Defines, Redefines to manipulate the data and then a Snapshot instead of doing everything in a custom ForeachSlot: this way it’s RDF that takes care of the multi-thread writing.

Cheers,
Enrico

danj1011 · January 16, 2023, 4:41pm

Yes (to the implicitMT question) I’m doing that. I’ll try to set up a reproducer.

Defines/Snapshot would be my default way to deal with this, but my particular ntuple has a vector per event(=branch) where the vectors are of different lengths and I need to be able to extract indices stored in one vector and use them to access the i’th element of another vector, and I got the impression that this would be beyond the capability of a Define string. Does that sound correct?

eguiraud · January 16, 2023, 5:37pm

You can do pretty much anything in a Define. At a certain level of complexity putting everything into an inline string becomes awkward but you can always write a small C++ function (ideally in its own C++ source file that you pre-compile and load in PyROOT, but you can also do everything from Python) and call that.
Here is an example: ROOT: tutorials/dataframe/df106_HiggsToFourLeptons.py File Reference

Cheers,
Enrico

system · January 30, 2023, 5:38pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.