Dear experts,
I have noticed a trend in the time consumption of some ROOT jobs and I wish to understand it.
Say I execute a ROOT macro which performs a job on a dataset of size N, and that when I execute it locally on the command line of my cluster, it takes half an hour.
Now say I have 100 datasets all of size similar to N (within statistical fluctuations), and I submit 100 parallel instances of my ROOT job to HTCondor which delegates the jobs to various work nodes on my cluster. I now see that many jobs takes a lot longer than half an hour… something in the neighbourhood of 2 hrs for example. There are a few that complete quicker (say in an hour), but none as quick as the local execution.
The only explanation I can come up with is that when Condor submits jobs to these work nodes, often times it might grab all the cores on a particular machine. There might be 10 jobs running on 10 cores of a machine at 1 job/core.
Is it possible that when I execute a single job locally, presumably without grabbing up all the cores, then ROOT, unknown to me, shares the job among a few cores and runs faster? Or is there some other explanation for this?
Thanks in advance for your help,
Arvind.