I tested your idea of changing the timeout value to a negative value. Now everything runs fine! Although the initialization phase of “Setting up worker nodes” is taking ~5min to complete.
For me the problem is solved, but if you want to run some more tests, I remain at your disposal.
Ok, so for some reason the startup of workers may be slow or delayed when doing via a batch system.
I’ll see if we can improve the diagnostic when this happens (at least report the number of workers
timing-out; and, in such a case, suggest to increase the the time-out).
Thanks a lot for your feedback. And patience.