Hi experts,
for my analysis I am using some ROOT based C++ code (more precisely, some framework derived from SFrame). The compilation works without any problems and the code runs fine with PROOF-Lite on my local machine.
Now, I moved to use a batch system for running my jobs. I submit multiple jobs to the batch system, each one with its own PROOF working directory (it is set to .proof in the temporary working directory on the worker node). While the code runs fine for most of the jobs, a small fraction (about 10%) of the jobs crashes with error messages like
Wrk-0.0: DDCore: version change (current: 5.34/01:45034, build: :-1): cleaning ...
Wrk-0.0: error: macro 'DDCore/PROOF-INF/SETUP.C' could not be loaded: cannot continue
Wrk-0.0: failure loading DDCore ...
DDCore in this case is a custom package with a corresponding .par file. The point is that the crashes happen quite randomly but always with the error message that the PROOF-INF/SETUP.C of a package could not be loeaded. As most of the jobs run fine and the package, for which the SETUP.C is not found, differs between the crashed jobs, I would exclude the possibility that it is a problem with the setup of the package.
At the moment I suspect some kind of concurrency or filesystem problem causing this error message and I try to pin down possible explanations for this behaviour. Is it possible that one could run into problems if there are two many PROOF-Lite instances running on the same worker node?
I know that these information a rather vague but I was wondering whether anyone else has observed similar problems/errors? I am using root 5.34.01 currently.
Thanks
Christian