I’m trying to fill a ROOT histogram using multiprocessing in Python, but the resulting ROOT file contains an empty histogram. The code runs without errors, but no data is actually filled into the histogram.
I wrote this small pice of code to illustrate the problem:
import ROOT as r
from numpy.random import normal, seed
from multiprocessing import Pool, cpu_count
seed(42)
def fill_histogram(histogram, numbers):
for number in numbers:
histogram.Fill(number)
if __name__ == "__main__":
r.EnableThreadSafety()
Histo = r.TH1F("h1", "h1", 100, 0, 100)
numbers = [normal(50, 20) for _ in range(1000)]
number_batches = [numbers[i:i + 100] for i in range(0, len(numbers), 100)]
pool = Pool(processes=int(cpu_count()))
for i, batch in enumerate(number_batches):
pool.apply_async(fill_histogram, args=(Histo, batch,))
pool.close()
pool.join()
output_file = r.TFile.Open("histo_test.root", "RECREATE")
Histo.Write()
output_file.Close()
What’s the correct approach? I suspect the issue might be related to how ROOT objects are passed between processes or thread safety, but I’m not sure how to resolve it.
Any guidance on the proper way to parallelize histogram filling with ROOT would be greatly appreciated.
ROOT Version: 6.32.08
Python Version: 3.12.3
Hello @destrada, welcome to the ROOT Forum!
First off, do you specifically need multiprocessing, or would multithreading be also fine? (judging from the code you posted it seems like the case).
The easiest way you can do what you want is probably by using RDataFrame together with EnableImplicitMT.
This way you can write something like this:
import ROOT
ROOT.EnableImplicitMT()
df = ROOT.RDataFrame(1000)
with ROOT.TFile.Open("histo_test.root", "RECREATE") as output_file:
h = df.Define("numbers", "gRandom->Gaus(50, 20)").Histo1D(("h1", "h1", 100, 0, 100), "numbers").GetValue()
output_file.WriteObject(h, h.GetName())
Note that if your real input comes from some other place (e.g. a TTree) you would need to do some minor adjustment to the RDataFrame creation (see the tutorials).
Also note that RDF currently doesn’t support calling python functions directly so I replaced normal
with the ROOT C++ function TRandom::Gaus
, but it should give the same result.
Let me know if this works for you.
Hi @silverweed , thanks a lot for your answer.
I’m using Python’s multiprocessing
simply because it’s the way I know how to do it. It’s not strictly necessary, but it allows me to parallelize the task in a straightforward way.
I can see that using RDataFrame
with EnableImplicitMT
would be much more efficient. However, I’m currently working on top of a small and old framework where the analysis is applied recursively Over a TTree, entry by entry. Switching to RDataFrame
would require significant changes to the codebase, which isn’t feasible at the moment.
That said, I found a workaround: by splitting the histogram into smaller clones, passing them to separate multiprocessing
workers, and then merging them into a single .root
file, the issue is resolved.
This solution works for now, but I still wonder—would it be possible to avoid creating clones of the histogram altogether?