Submitting parallel jobs on dirac

If you are not familiar with writing shell scripts to run jobs on a queuing system, you should first read the (simpler) page on submitting serial jobs to become familiar with the general ideas.

If a parallel job terminates abnormally (e.g. you kill it, or it is killed automatically by the queuing system, or it dies on its own for some reason), when the job dies it may leave some processes still running on the nodes. You should use the command mynodeprocs to see which nodes have processes belonging to you running on them. If any do not correspond to jobs still in the queue (mynodeprocs shows your jobs on each node, but can have trouble with long lists of nodes for one job - check with qstat -anu USERNAME) kill them with the command:

      rsh NODENAME "kill PID_NO"

where NODENAME is the name of the node e.g. node7, and PID_NO is the number of the process, given by mynodeprocs.

When you run a parallel job the nodes must know the correct environment in which the job was compiled. This is because MPICH has to be compiled with the same compiler as the job, and using the same method (e.g. Myrinet) of node inter-communication. The command to start a parallel job using Myrinet is mpirun.ch_gm (instead of the usual mpirun), but there are three versions, compiled using the Portland, Intel and Gnu compilers. To make sure the appropriate version is used, none of the directories containing these commands are on the default PATH. Instead the appropriate directory is placed on your PATH by running a script. Calls for these scripts are placed in your default .tcshrc and .bashrc files.

For the bash shell these calls are:

. /usr/local/sbin/usechgm121-7b    Myrinet with gnu compilers
. /usr/local/sbin/usechgmp121-7b   Myrinet with Portland compilers
. /usr/local/sbin/usechgmi121-7b   Myrinet with Intel compilers

. /usr/local/sbin/uselam652        LAM-MPI with gnu compilers
. /usr/local/sbin/uselamp652       LAM-MPI with Portland compilers
. /usr/local/sbin/uselami652       LAM-MPI with Intel compilers

. /usr/local/sbin/usech            MPICH with gnu compilers
. /usr/local/sbin/usechp           MPICH with Portland compilers
. /usr/local/sbin/usechi           MPICH with Intel compilers

For the tcsh shell these calls are:

source /usr/local/sbin/usechgm121-7b.tcsh    Myrinet with gnu compilers
source /usr/local/sbin/usechgmp121-7b.tcsh   Myrinet with Portland compilers
source /usr/local/sbin/usechgmi121-7b.tcsh   Myrinet with Intel compilers

source /usr/local/sbin/uselam652.tcsh        LAM-MPI with gnu compilers
source /usr/local/sbin/uselamp652.tcsh       LAM-MPI with Portland compilers
source /usr/local/sbin/uselami652.tcsh       LAM-MPI with Intel compilers

source /usr/local/sbin/usech.tcsh            MPICH with gnu compilers
source /usr/local/sbin/usechp.tcsh           MPICH with Portland compilers
source /usr/local/sbin/usechi.tcsh           MPICH with Intel compilers

When you submit a job to the queuing system, your PATH is copied to the PATH of the shell running the job. To set the correct environment, one method is to make sure that all of these calls except the correct one are commented out (by a # symbol at the start of the lines) in your .bashrc or .tcshrc file and then log out and log back in again before submitting the job.

A better method is to have all of the calls commented out (except when you are compiling and testing parallel programs) and have the call in your job submission script. This is the recommended method, and the calls will be in the example scripts below.

It is assumed that Myrinet will be the method used for node inter-communication as it is the fastest, and the example scripts will use this method.

A job submission script for a parallel job is a little more compilcated than for a serial job, as the run command (mpirun.ch_gm) needs information about such things as the number of processors etc.

The command mpirun.ch_gm can take various flags:

Always have sleep 60 as the final line of your script. If you do not do this, the next job just starting may catch up with your job, causing it to crash in a mess. If you find problems with the shared memory segments not clearing, the commands you need are ipcs and ipcrm.

The description below assumes that you will want to use an even number of processors. The first example script below is for a program which does not need scratch space on the nodes on which it is running (DLPOLY in this case). A copy of this script is in /usr/local/sbin, called qsubparallel1.example, which, if you wish, you can copy, rename and edit (there appears to be a maximum of 15 characters for the length of the name). You should edit the line:

      #PBS -l walltime=6:00:00, nodes=4:ppn=2

for the amount of walltime you need and the number of nodes and processors per node (hence the total number of processors). Nodes= gives the number of nodes and ppn= gives the number of processors per node, so 4 nodes and 2 processors per node gives a total of 8 processors. Yes, you do need the # symbol at the start of the line, with no space between it and PBS. The nodes resource must be in the order shown, the number of nodes and then the number of processors per node.

The line:

      source /usr/local/sbin/usechgmp121-7b.tcsh

sets the correct environment (the Portland compiler with Myrinet, in this case). You should edit the line:

      setenv MYDIR "${HOME}/DLPTEST/TEST3"

changing DLPTEST/TEST3 to the directory (relative to your home directory) where your input files are, and your output files will go.

You should edit the line:

       setenv MYEX "${HOME}/DLPTEST/bin/pDLPOLY.X"

changing DLPTEST/bin/pDLPOLY.X to the path to your executable (relative to your home directory). If your executable takes arguments e.g. if you wish to redirect stdout, include this in MYEX (enclosed in the inverted commas) e.g.

      setenv MYEX "${HOME}/TEST1/a.out > outfile"

#!/bin/tcsh -f 
#
#PBS -l walltime=6:00:00,nodes=4:ppn=2
# Tell PBS to use 4 nodes and 2 processors per node
#PBS -j oe
#
#-------------------------------------------------
#        edit this part
#
#   get the right environment if you have all the lines 
#   commented out in your .bashrc or .tcshrc file
#   this is Portland compiler/Myrinet
#
source /usr/local/sbin/usechgmp121-7b.tcsh
#
#   your working directory
setenv MYDIR "${HOME}/DLPTEST/TEST3"
#
#   the executable
setenv MYEX "${HOME}/DLPTEST/bin/pDLPOLY.X" 
#
#-------------------------------------------------
#       this part can stay the same
#
#   make sure you are in the correct directory
cd $MYDIR
#
#   get the job id no
setenv JOBNO "`echo $PBS_JOBID | sed s/.chm.bris.ac.uk//`"
#
#   get list of nodes allocated for job. 
#   If you ask for two processors per node,
#   the node will be entered twice in the list
set nodelist = `cat $PBS_NODEFILE`
#
#   get the number of processors
setenv NUMPROC ${#nodelist}
#
#   name the gm configuration file
setenv CONFILE "${MYDIR}/gm.${JOBNO}.conf"
#
#   put number of processors in gm configuration file
echo  $NUMPROC > $CONFILE
#
#   put node and port numbers in gm configuration file    
#
set prev = ""
foreach iii ( ${nodelist} )
  if ( ${prev} != ${iii} ) then
    echo ${iii} 4 >> $CONFILE
  else
    echo ${iii} 2 >> $CONFILE
  endif
  set prev = ${iii}
end
#
#    run job
mpirun.ch_gm  --gm-kill 1 --gm-w 1 --gm-v --gm-use-shmem  -np $NUMPROC --gm-f $CONFILE $MYEX
#
#    after job has run wait 60 seconds to give
#    all processes time to die
sleep 60

If you need to make your job run on the large memory nodes, add the attribute bigmem, as:

      #PBS -l walltime=6:00:00,nodes=4:ppn=2:bigmem

You must use the order, no of nodes: processors per node: then thebigmem attribute.