I’m running my analysis using ROOT.ROOT.EnableImplicitMT(numthreads=32).
It works fine on my local machine during tests, but looks like on our production server time to time ROOT doesn’t catch a number of predefined threads to be used. That happens on a random basis. ROOT just falls back to 1 thread (100% CPU utilization, while +3150% is expected).
However, ROOT.ROOT.GetImplicitMTPoolSize() always shows the correct size of the pool used for implicit multithreading (in my case 32).
What could be a reason?
We use an external data storage system. Could it be that ROOT do tests for the file system communication and in case of low-performance ROOT falls back to the 1 thread?
It depends on what you are doing. If you are running python code, then the GIL will effective prevent multithreading in your code. If you are just calling C++, then we need to know what you are running, or have a simple form of your script that we can run and check what’s going on. Note that ROOT also has an internal global lock for accesses to the type system that may or may not affect parallelism of your code. Cheers,
I performed a set of time measurement runs on my local machine with a local SSD drive:
No EnableImplicitMT(): as reference
Running 0 ROOT threads
real 0m5.897s
user 0m4.940s
sys 0m0.405s
EnableImplicitMT(): and really surprizes me
Running 4 ROOT threads
real 0m16.820s
user 0m14.787s
sys 0m21.859s
However, on this step, I observe ±250% CPU utilization.
Our TTrees doesn’t use default settings. There was some tuning in the GRID for it: CacheSize = 0 AutoSave = 500 AutoFlush = 500
Converting given root file to the root file with default settings doesn’t help. RDF runs 4 times longer with EnableImplicitMT().
How many clusters are there in your file? RDataFrame uses one task per cluster in many cases, so if you have many clusters, it could create more overhead when running in multithreaded mode.
The easiest thing to do is probably to just make a snapshot of the whole file with RDataFrame. You can use the snapshot options to reduce the number of clusters in the output via the fAutoFlush setting, but you probably don’t need to modify it. If your file is not too big, I also recommend to use a faster compression algorithm like LZ4 or even leaving the file uncompressed if you can, since that will speedup looping over it by a lot. Cheers,
Then I’d like to take a look. If you can share your script and input file (you can make them available in EOS somewhere), then I can have a look and help figure out what’s the problem.
The problem is the usage of TLorentzVector, most of the time is spent in its custom streamer, so it’s not a problem of Python being slower than C++. My advice is to either move to TMath::LorentzVector, and/or save the TLorentzVector as x,y,z,t or pT,eta,phi,M components using branches of basic types, then using a Define with each component as argument to reconstruct it during analysis. That will probably speed your code quite a bit. Cheers,