Proof merge crashed for more workers but OK for less workers

Hi,

We at BNL experienced a weird problem with Proof job crashed during merging stage with more workers (says 50 workers) and fine with less workers (says 25 workers). The number of workers at which the job crashed varied dynamically. I monitored the memory consumption on the master node and found that the job reached 2GB prior to the crash. But the job ran fine on Proof-Lite with any number of workers, though the memory by Proof-lite reached beyond 2.6GB. I just ran Proof in Valgrind mode and got the following message:

According to the error message, it looks like that it could not create new object. Any idea? Many users at BNL are experiencing the same problem, and your kind assistance on resolving this problem would be greatly appreciated.

The ROOT version on the TProof server in use is 5.27.04, and OS is SL5.

–Shuwei

Dear Shuwey,

Your valgrind output is with an optimized version of ROOT, so the only really meaningful information is the fact that ‘new’ fails. This points to a memory issue.
What objects are you merging? Or, do you expect a large memory usage from the objects that you merge?
Depending on the object type, TFileMerger may load everything in memory (hists, for example), so you have to multiply the expected output size by N_wrks …
If I understand correctly what crashes is the master and you seem to have a 2GB memory limit on the master machine (a ulimit setting?).
How many workers did you run in PROOF-Lite?

Gerri

Hi Gerri,

Thanks for your reply.

There are many 2D hists for merge via TProofOutputFile. Currently we have 96 workers, and the job started to crash at about 40 workers. But this number of workers varies dynamically. Sometimes it crashed at 50 workers, sometimes at 30 workers.

In Proof-Lite mode on the master node, I tried “workers=96”, the same number as on Proof , and the job still ran well though the memory consumption by root.exe reached 2.6GB, beyong 2GB.

The fact that the merge runs fine on Proof-Lite but crashes on Proof puzzles me. What difference on file merge between Proof-Lite and Proof? How can I further investigate this problem?

–Shuwei

Hi Shuwey,

My interpretation is that the master inherits a memory limit which is not active in the PROOF-Lite session. Perhaps I am wrong, but we do not have yet all the elements to rule out that.

So, in my opinion, the first thing to do is to take the size of the output that you get when it works and make some simple calculation to estimate the transient memory needed as a function of the number of workers. This is to see if you can roughly understand the number of workers (about 40) at which it crashes.

Then you should check the xrootd or xproofd startup scripts for any ulimit setting. Do you have access to those scripts? Do you have access to the xrootd or xproofd configuration file? If yes, could you post these files?

Gerri

Hi Gerri

The proofd configuration is:

And xrootd configuration is:

The final merged (and good) output root file is only 640 KB, but the compression factor is about 200.

–Shuwei

Hi,

It finally turned out that there are 2GB limit on virtual memory and 1GB on resident memory set in /etc/rc.d/init.d/xproofd. After having remove those limits, the crash with merging is gone.

–Shuwei