I am playing around with ROOT I/O again and built some scripts that read remote ROOT files. The nature of the task is to do small random reads, which of course have very high latency on remote files, but I can live with that. To offset the high latency, I want to use many parallel threads to keep many requests in flight at the same time, and this works very well: I use
multiprocessing.pool.ThreadPool (that is a real threadpool, not the normal multiprocessing process pool), and
metree.GetEntry._threaded = True, and everything works pretty well.
Except there is one problem: To read a file from 100’s of threads, I open the file 100’s of times (I assume using the same
TFile from multiple threads is a bad idea), but opening the file takes multiple seconds. So I tried to set
ROOT.TFile.Open._threaded = True, and it does seem to work, though ROOT really seems to not like this being called from multiple threads at the same time. With a lock on the Python side around it I can at least release the GIL that way and overlap other operations, but opening 100’s of files sequentially is still a bottleneck.
Is there a better way to parallelize the waiting time when opening remote files? I could go with a process Pool, but that seems quite a bit more painful than a threadpool.
Any hints are appreciated.
ROOT Version: JupyROOT 6.18/04
Platform: CC7, CMSSW_11_1_PY3 environment
Compiler: g++ (GCC) 8.3.1 20190225