Unable to use more than one compute node in HPC cluster

Hello,

HPC cluster using Infiniband FDR | CentOS-6.7 64bit | SLURM-14.03.7 | one master and 10 compute nodes | Having Gluster File system.

We are running ROOT-5.34 and Madgraph-2.6.0 installed and running fine in the above cluster using a shell script in one compute node(16 Intel cores).

But when we are using more than one compute node, by declaring in the script, it runs and then gives error. Moreover the time it takes is enormous.

We also changed the mg5_configuration.txt file by changing queue name, cluster name, running more but no luck.

Any help would be appreciated.

Thanks! Sudeep

That’s not enough information for us to help. Let’s start with this:

How do you declare what? What is “the script”?

What is the error message?

Dear Axel,

Thanks a lot for the response and sorry for not providing adequate information.

Here is the script:
#!/bin/bash
#SBATCH --job-name=myjob1
#SBATCH --nodes=1 //here we declare whether it is single or more than one node//
#SBATCH --ntasks-per-node=16
#SBATCH --error=test.%J.err
#SBATCH --output=test.%J.out
#SBATCH --partition=main

echo "this is my test hello program"
cd ~/Research17/1
MACHINEFILE=machinefile
scontrol show hostname $SLURM_JOB_NODELIST > $MACHINEFILE
/opt/baradwaj_grp_sw/satendrak/ROOT/softwares/MG5_aMC_v2_6_0/bin/mg5_aMC PP2Zpjj2zh2llbb.dat
echo “My test is done”

The error:
Unfortunately we have deleted the folder that contained the error. We are giving the run again with lesser number of events and that should run by next 2-3 hrs.

Apologies again for bothering you, with incomplete data.

Regards, Sudeep

Hi Alex,

How do I upload the mg5_configuration.txt file here? Is there any upload button/option?

Regards, Sudeep

Hi Alex,

Moreover, I just submitted a job in my slurm scheduler with the shell script, for 2 compute nodes.

But looks like it is creating many other jobs automatically in Pending mode with different Job IDs.

[satendrak@hpc 1]$ squeue -l
Thu Sep 14 16:49:16 2017
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
20865 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20866 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20867 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20868 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20869 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20870 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20871 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20872 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20873 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20874 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20875 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20876 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20877 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20878 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20879 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20880 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20881 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20882 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Priority)
20864 main a3442d44 satendra PENDING 0:00 2-00:00:00 1 (Resources)
20863 main myjob1 satendra RUNNING 5:41 2-00:00:00 2 node[2-3]

The Last job 20863 is real one. Any idea?

Hi,

Request you for a kind update about the issue we are facing. Thank you in advance.

Hi,

We know ROOT. We don’t know slurm nor mg5 - I am not sure we are the right people to talk to here. Unless you have an error message from ROOT I don’t know how I can help…

Axel.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.