Multicore/multithreading fall back to 1 thread

Dear experts,

I’m running my analysis using ROOT.ROOT.EnableImplicitMT(numthreads=32).
It works fine on my local machine during tests, but looks like on our production server time to time ROOT doesn’t catch a number of predefined threads to be used. That happens on a random basis. ROOT just falls back to 1 thread (100% CPU utilization, while +3150% is expected).
However, ROOT.ROOT.GetImplicitMTPoolSize() always shows the correct size of the pool used for implicit multithreading (in my case 32).

What could be a reason?
We use an external data storage system. Could it be that ROOT do tests for the file system communication and in case of low-performance ROOT falls back to the 1 thread?


ROOT Version: 6-16-00
Platform: 3.10.0-957.1.3.el7.x86_64
Compiler: gcc (GCC) 8.2.0
LCG_95 x86_64-centos7-gcc8-opt


It depends on what you are doing. If you are running python code, then the GIL will effective prevent multithreading in your code. If you are just calling C++, then we need to know what you are running, or have a simple form of your script that we can run and check what’s going on. Note that ROOT also has an internal global lock for accesses to the type system that may or may not affect parallelism of your code. Cheers,

Dear @amadio,

Thanks a lot for pointing me at GIL. That something I have to keep in mind during analysis SW developing.

Here is simple test script (input is this root file):

#!/usr/bin/env python
import ROOT

def main():
	file = 'dponomar.root'
	treeName = 'NOMINAL'

	# ROOT.ROOT.EnableImplicitMT()
	print "Running %s ROOT threads"%(ROOT.ROOT.GetImplicitMTPoolSize())
	df = ROOT.RDataFrame(treeName, file)
	model = ROOT.RDF.TH1DModel("lep_0_p4.Pt()", ";p_{T} (lep_{0}) GeV;", 100, 0., 100.)
	hist = df.Define("myP4", "lep_0_p4.Pt()").Histo1D(model, "myP4")
	hist.Draw()

if __name__ == '__main__': main()
#EOF

I performed a set of time measurement runs on my local machine with a local SSD drive:

  • No EnableImplicitMT(): as reference
Running 0 ROOT threads
real	0m5.897s
user	0m4.940s
sys	0m0.405s
  • EnableImplicitMT(): and really surprizes me
Running 4 ROOT threads
real	0m16.820s
user	0m14.787s
sys	0m21.859s

However, on this step, I observe ±250% CPU utilization.

Our TTrees doesn’t use default settings. There was some tuning in the GRID for it:
CacheSize = 0
AutoSave = 500
AutoFlush = 500
Converting given root file to the root file with default settings doesn’t help. RDF runs 4 times longer with EnableImplicitMT().

What else could be the reason?


ROOT Version: tags/v6-16-00@v6-16-00
Platform: macosx64
Compiler: gcc 4.2.1
Python: 2.7.10
2,5 GHz Intel Core i7, 2 cores
16 Gb 2133 MHz LPDDR3


How many clusters are there in your file? RDataFrame uses one task per cluster in many cases, so if you have many clusters, it could create more overhead when running in multithreaded mode.

Dear @amadio,

You right, there are 6534 clusters in this file.

Could you please also point me at some scripts/solution how to reprocess files and decrease number of clusters?

Thanks a lot for you help!

Best regards,
Daniil

The easiest thing to do is probably to just make a snapshot of the whole file with RDataFrame. You can use the snapshot options to reduce the number of clusters in the output via the fAutoFlush setting, but you probably don’t need to modify it. If your file is not too big, I also recommend to use a faster compression algorithm like LZ4 or even leaving the file uncompressed if you can, since that will speedup looping over it by a lot. Cheers,

Thanks!

I had prepared new TTree with RDF.Snapshot using default options. New TTree has 23 clusters, but the problem is still there:

  • No EnableImplicitMT():
Running 0 ROOT threads
real	0m3.514s
user	0m3.121s
sys	0m0.269s
  • With EnableImplicitMT():
Running 4 ROOT threads
real	0m14.324s
user	0m11.177s
sys	0m19.082s

There are 3266983 events and _file0->GetSize() says it is 521 MB.

Then I’d like to take a look. If you can share your script and input file (you can make them available in EOS somewhere), then I can have a look and help figure out what’s the problem.

Dear @amadio,

Original file and python script are available on my CernBox. There are also converted file that been made using RDF.Snapshot.

Thanks a lot for your effort! That is quite an old problem and has a huge impact on our analysis.

The problem is the usage of TLorentzVector, most of the time is spent in its custom streamer, so it’s not a problem of Python being slower than C++. My advice is to either move to TMath::LorentzVector, and/or save the TLorentzVector as x,y,z,t or pT,eta,phi,M components using branches of basic types, then using a Define with each component as argument to reconstruct it during analysis. That will probably speed your code quite a bit. Cheers,

Dear @amadio,

Thanks a lot! I managed to reprocess data (write new TMath::LorentzVectors) and that help!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.