Unable to use more than one compute node in HPC cluster

snbanerjee · September 14, 2017, 9:58am

Hello,

HPC cluster using Infiniband FDR | CentOS-6.7 64bit | SLURM-14.03.7 | one master and 10 compute nodes | Having Gluster File system.

We are running ROOT-5.34 and Madgraph-2.6.0 installed and running fine in the above cluster using a shell script in one compute node(16 Intel cores).

But when we are using more than one compute node, by declaring in the script, it runs and then gives error. Moreover the time it takes is enormous.

We also changed the mg5_configuration.txt file by changing queue name, cluster name, running more but no luck.

Any help would be appreciated.

Thanks! Sudeep

Axel · September 14, 2017, 10:43am

That’s not enough information for us to help. Let’s start with this:

How do you declare what? What is “the script”?

What is the error message?

snbanerjee · September 14, 2017, 11:09am

Dear Axel,

Thanks a lot for the response and sorry for not providing adequate information.

Here is the script:
#!/bin/bash
#SBATCH --job-name=myjob1
#SBATCH --nodes=1 //here we declare whether it is single or more than one node//
#SBATCH --ntasks-per-node=16
#SBATCH --error=test.%J.err
#SBATCH --output=test.%J.out
#SBATCH --partition=main

echo "this is my test hello program"
cd ~/Research17/1
MACHINEFILE=machinefile
scontrol show hostname $SLURM_JOB_NODELIST > $MACHINEFILE
/opt/baradwaj_grp_sw/satendrak/ROOT/softwares/MG5_aMC_v2_6_0/bin/mg5_aMC PP2Zpjj2zh2llbb.dat
echo “My test is done”

The error:
Unfortunately we have deleted the folder that contained the error. We are giving the run again with lesser number of events and that should run by next 2-3 hrs.

Apologies again for bothering you, with incomplete data.

Regards, Sudeep

snbanerjee · September 14, 2017, 11:17am

Hi Alex,

How do I upload the mg5_configuration.txt file here? Is there any upload button/option?

Regards, Sudeep

snbanerjee · September 14, 2017, 11:22am

Hi Alex,

Moreover, I just submitted a job in my slurm scheduler with the shell script, for 2 compute nodes.

But looks like it is creating many other jobs automatically in Pending mode with different Job IDs.

[satendrak@hpc 1]$ squeue -l
Thu Sep 14 16:49:16 2017
JOBID PARTITION NAME USER STATE 20865 main a3442d44 satendra PENDING 20866 main a3442d44 satendra PENDING 20867 main a3442d44 satendra PENDING 20868 main a3442d44 satendra PENDING 20869 main a3442d44 satendra PENDING 20870 main a3442d44 satendra PENDING 20871 main a3442d44 satendra PENDING 20872 main a3442d44 satendra PENDING 20873 main a3442d44 satendra PENDING 20874 main a3442d44 satendra PENDING 20875 main a3442d44 satendra PENDING 20876 main a3442d44 satendra PENDING 20877 main a3442d44 satendra PENDING 20878 main a3442d44 satendra PENDING 20879 main a3442d44 satendra PENDING 20880 main a3442d44 satendra PENDING 20881 main a3442d44 satendra PENDING 20882 main a3442d44 satendra PENDING 20864 main a3442d44 satendra PENDING 20863 main myjob1 satendra RUNNING TIME TIMELIMIT NODES NODELIST(REASON)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Priority)
0:00 2-00:00:00 1 (Resources)
5:41 2-00:00:00 2 node[2-3]

The Last job 20863 is real one. Any idea?

snbanerjee · September 19, 2017, 7:27am

Hi,

Request you for a kind update about the issue we are facing. Thank you in advance.

Axel · September 20, 2017, 7:14pm

Hi,

We know ROOT. We don’t know slurm nor mg5 - I am not sure we are the right people to talk to here. Unless you have an error message from ROOT I don’t know how I can help…

Axel.

system · October 4, 2017, 7:14pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.