Submitting parallel jobs on dirac
If you are not familiar with writing shell scripts to run jobs on a queuing system, you should first read the (simpler) page on submitting serial jobs to become familiar with the general ideas.
If a parallel job terminates abnormally (e.g. you kill it, or it is killed automatically by the queuing system, or it dies on its own for some reason), when the job dies it may leave some processes still running on the nodes. You should use the command mynodeprocs to see which nodes have processes belonging to you running on them. If any do not correspond to jobs still in the queue (mynodeprocs shows your jobs on each node, but can have trouble with long lists of nodes for one job - check with qstat -anu USERNAME) kill them with the command:
rsh NODENAME "kill PID_NO"
where NODENAME is the name of the node e.g. node7, and PID_NO is the number of the process, given by mynodeprocs.
When you run a parallel job the nodes must know the correct environment in which the job was compiled. This is because MPICH has to be compiled with the same compiler as the job, and using the same method (e.g. Myrinet) of node inter-communication. The command to start a parallel job using Myrinet is mpirun.ch_gm (instead of the usual mpirun), but there are three versions, compiled using the Portland, Intel and Gnu compilers. To make sure the appropriate version is used, none of the directories containing these commands are on the default PATH. Instead the appropriate directory is placed on your PATH by running a script. Calls for these scripts are placed in your default .tcshrc and .bashrc files.
For the bash shell these calls are:
. /usr/local/sbin/usechgm121-7b Myrinet with gnu compilers . /usr/local/sbin/usechgmp121-7b Myrinet with Portland compilers . /usr/local/sbin/usechgmi121-7b Myrinet with Intel compilers . /usr/local/sbin/uselam652 LAM-MPI with gnu compilers . /usr/local/sbin/uselamp652 LAM-MPI with Portland compilers . /usr/local/sbin/uselami652 LAM-MPI with Intel compilers . /usr/local/sbin/usech MPICH with gnu compilers . /usr/local/sbin/usechp MPICH with Portland compilers . /usr/local/sbin/usechi MPICH with Intel compilers
For the tcsh shell these calls are:
source /usr/local/sbin/usechgm121-7b.tcsh Myrinet with gnu compilers source /usr/local/sbin/usechgmp121-7b.tcsh Myrinet with Portland compilers source /usr/local/sbin/usechgmi121-7b.tcsh Myrinet with Intel compilers source /usr/local/sbin/uselam652.tcsh LAM-MPI with gnu compilers source /usr/local/sbin/uselamp652.tcsh LAM-MPI with Portland compilers source /usr/local/sbin/uselami652.tcsh LAM-MPI with Intel compilers source /usr/local/sbin/usech.tcsh MPICH with gnu compilers source /usr/local/sbin/usechp.tcsh MPICH with Portland compilers source /usr/local/sbin/usechi.tcsh MPICH with Intel compilers
When you submit a job to the queuing system, your PATH is copied to the PATH of the shell running the job. To set the correct environment, one method is to make sure that all of these calls except the correct one are commented out (by a # symbol at the start of the lines) in your .bashrc or .tcshrc file and then log out and log back in again before submitting the job.
A better method is to have all of the calls commented out (except when you are compiling and testing parallel programs) and have the call in your job submission script. This is the recommended method, and the calls will be in the example scripts below.
It is assumed that Myrinet will be the method used for node inter-communication as it is the fastest, and the example scripts will use this method.
A job submission script for a parallel job is a little more compilcated than for a serial job, as the run command (mpirun.ch_gm) needs information about such things as the number of processors etc.
The command mpirun.ch_gm can take various flags:
- An important one, which should always be used is --gm-w 1 which causes a wait of one second between starting each process. This is necessary because he speed of the machines can sometimes cause problems with rsh-ing to the nodes, which causes the job to die in a mess.
- You should also use --gm-kill 1 which waits one second after a process dies and then kills any other processes. Useful if something goes wrong with your job.
- You must specify the number of processors, either by --gm-np n or -np n, where n is the number of processors. The two forms are equivalent.
- You must specify a configuration file by --gm-f filename, where filename is the name of the configuration file. The form of this file is the number of processors and then an entry for each processor, with the node name and the myrinet port, each on a new line. Processors on the same node must use different ports of course. Myrinet normally has eight ports, of which numbers 2, 4, 5, 6 and 7 are available to users.
- --gm-use-shmem enables shared memory support. This is
generally recommended, but for certain applications may reduce
performance: the latency is much better, but the peak bandwidth
depends on the performance of the memory copy code provided by the
OS.
--gm-shmem-file filename spcifies a shared memory file and --gm-shf explicitly removes the shared memory file. - --gm-v generates verbose output.
- --gm-recv type, where type is one of polling, blocking or
hybrid, changes the behavior of the blocking MPI call, the default is
polling. The polling mode asks MPI to poll all devices
continually to check for the completion of an event. This mode
provides the lowest latency but also has the highest CPU utilisation.
It provides the best performance when each process has a dedicated
processor.
The blocking mode means each MPI blocking function call will sleep in the kernel waiting for an interrupt from the Myrinet interface. The CPU utilisation is minimal but increases the latency. This mode is very efficient when several processes compete for the same processor. This is the case for some multi-threaded applications or some MPI applications that spawn several processes per processor by default (e.g. GAMESS).
The hybrid mode is a combination of the two previous modes. The process will poll for one millisecond and then sleep as in the blocking mode. This mode provides a good balance between the waste of CPU and the cost of the interrupt overhead. However you cannot use blocking or hybrid mode if you use shared memory. - --gm-recv-verb specifies verbose for the recv mode selection.
- --gm-dryrun doesn't actually execute the commands but just prints them.
- --gm-r start processors in reverse order.
Always have sleep 60 as the final line of your script. If you do not do this, the next job just starting may catch up with your job, causing it to crash in a mess. If you find problems with the shared memory segments not clearing, the commands you need are ipcs and ipcrm.
The description below assumes that you will want to use an even number of processors. The first example script below is for a program which does not need scratch space on the nodes on which it is running (DLPOLY in this case). A copy of this script is in /usr/local/sbin, called qsubparallel1.example, which, if you wish, you can copy, rename and edit (there appears to be a maximum of 15 characters for the length of the name). You should edit the line:
#PBS -l walltime=6:00:00, nodes=4:ppn=2
for the amount of walltime you need and the number of nodes and processors per node (hence the total number of processors). Nodes= gives the number of nodes and ppn= gives the number of processors per node, so 4 nodes and 2 processors per node gives a total of 8 processors. Yes, you do need the # symbol at the start of the line, with no space between it and PBS. The nodes resource must be in the order shown, the number of nodes and then the number of processors per node.
The line:
source /usr/local/sbin/usechgmp121-7b.tcsh
sets the correct environment (the Portland compiler with Myrinet, in this case). You should edit the line:
setenv MYDIR "${HOME}/DLPTEST/TEST3"
changing DLPTEST/TEST3 to the directory (relative to your home directory) where your input files are, and your output files will go.
You should edit the line:
setenv MYEX "${HOME}/DLPTEST/bin/pDLPOLY.X"
changing DLPTEST/bin/pDLPOLY.X to the path to your executable (relative to your home directory). If your executable takes arguments e.g. if you wish to redirect stdout, include this in MYEX (enclosed in the inverted commas) e.g.
setenv MYEX "${HOME}/TEST1/a.out > outfile"
#!/bin/tcsh -f
#
#PBS -l walltime=6:00:00,nodes=4:ppn=2
# Tell PBS to use 4 nodes and 2 processors per node
#PBS -j oe
#
#-------------------------------------------------
# edit this part
#
# get the right environment if you have all the lines
# commented out in your .bashrc or .tcshrc file
# this is Portland compiler/Myrinet
#
source /usr/local/sbin/usechgmp121-7b.tcsh
#
# your working directory
setenv MYDIR "${HOME}/DLPTEST/TEST3"
#
# the executable
setenv MYEX "${HOME}/DLPTEST/bin/pDLPOLY.X"
#
#-------------------------------------------------
# this part can stay the same
#
# make sure you are in the correct directory
cd $MYDIR
#
# get the job id no
setenv JOBNO "`echo $PBS_JOBID | sed s/.chm.bris.ac.uk//`"
#
# get list of nodes allocated for job.
# If you ask for two processors per node,
# the node will be entered twice in the list
set nodelist = `cat $PBS_NODEFILE`
#
# get the number of processors
setenv NUMPROC ${#nodelist}
#
# name the gm configuration file
setenv CONFILE "${MYDIR}/gm.${JOBNO}.conf"
#
# put number of processors in gm configuration file
echo $NUMPROC > $CONFILE
#
# put node and port numbers in gm configuration file
#
set prev = ""
foreach iii ( ${nodelist} )
if ( ${prev} != ${iii} ) then
echo ${iii} 4 >> $CONFILE
else
echo ${iii} 2 >> $CONFILE
endif
set prev = ${iii}
end
#
# run job
mpirun.ch_gm --gm-kill 1 --gm-w 1 --gm-v --gm-use-shmem -np $NUMPROC --gm-f $CONFILE $MYEX
#
# after job has run wait 60 seconds to give
# all processes time to die
sleep 60
If you need to make your job run on the large memory nodes, add the attribute bigmem, as:
#PBS -l walltime=6:00:00,nodes=4:ppn=2:bigmem
You must use the order, no of nodes: processors per node: then thebigmem attribute.