Executing parallel jobs with qsub and pxargs

Introduction/Background

This is an introduction on how to use some of the NEX and NAS tools to easily parallelize your job execution on Pleiades.

Applicability

This is mainly applicable to running a set of jobs using the same executable, but varying the parameters and/or input data. These are often referred as embarrassingly parallel problems, because there is no dependence and no need to communicate between the jobs in the set during execution - they are completely independent. An example in Earth sciences would be processing of satellite data that is already broken down into tiles (and each tile resides in a separate file) as is often the case and the executable can be used on one tile/file at a time. The following examples will show how to set up and use the combination of PBS and pxargs to execute parallel jobs on Pleiades.

What is pxargs?

The pxargs utility builds and executes command lines from a newline delimited list in a file or from standard input. There are some similarities with the linux/unix xargs utility. In order to use pxargs, you need to set up NEX loadable modules and then you can do:

me@linux:> module load pxargs/0.0.3-mpt.2.06a67_nex

This loads the module and the dependencies - you may check for other versions if needed using module avail

How to use pxargs

An example usage would be (don't do this on the bridge/pfe nodes as this is just an illustration on how it works and will not work without MPI environment, which is setup on the Pleiades compute nodes):

me@linux:> mpiexec -np 16 --arg-file=/tmp/arglist.txt --proc=/my/tools/bin/process_tile

The command starts 16 processes and executes /my/tools/bin/proces_tile executable on every line in /tmp/arglist.txt. So assuming that process_tile is an executable or a script that can be independently run on a single tile like this:

me@linux:> process_tile --tile=/data/MOD13Q1.h08v05.hdf --segment --level=3 -v

The lines in the arglist.txt file might look like this:

me@linux:> cat arglist.txt
 --tile=/data/MOD13Q1.h08v05.hdf --segment --level=3 -v
 --tile=/data/MOD13Q1.h09v05.hdf --segment --level=3 -v
 --tile=/data/MOD13Q1.h09v06.hdf --segment --level=3 -v
 --tile=/data/MOD13Q1.h12v05.hdf --segment --level=3 -v
 --tile=/data/MOD13Q1.h12v04.hdf --segment --level=3 -v
...

During the execution, each of the line (which is just a parametrization of the execution) is prefixed with the executable or script and executed in a separate process.

IMPORTANT: Note that you can submit lot more jobs than the number of processes. In the fictional example above we could use 16 processes to execute 100's of jobs = 100's of lines in the arglist.txt file and they would be distributed evenly across the 16 processes and executed 15 at a time (15 because pxargs itself lives in one of the processes)

For more information on pxargs, please see the man pages

me@linux:> man pxargs

Combining PBS with pxargs for large-scale executions

Before diving into the PBS+pxargs topic, it is advisable to familiarize yourself with the NAS supercomputing architecture and best practices, such as:

  1. Pleiades computing hardware
  2. Pleiades compute nodes and their memory and core configurations
  3. Pleiades queues and scheduling overview
  4. Running jobs with PBS on Pleiades

Here is an example PBS script with pxargs:

me@linux:> cat example.pbs
#PBS -S /bin/bash

# Give the run some name (up to 15 characters so that it can be easily found in the logs)
#PBS -N proc_tiles_v1

# In this example - select 20 IvyBridge nodes and on each node run 15 processes 
# That is total of 20*15 = 300 cores
#PBS -l select=20:ncpus=15:mpiprocs=15:model=ivy

# The walltime of the job (once it gets scheduled in) will be less than 8 hours, 
# otherwise it will get killed
#PBS -l walltime=8:00:00

# Under which gid is this job executed (just run groups to see your gids)
# And schedule the job in the "normal" queue with 8-hour runtime limit
#PBS -W group_list=s1007
#PBS -q normal

# Setup logging
#PBS -j oe           
#PBS -m e            
#PBS -V
#PBS -e /nobackupp4/somedir/runtime/pbs
#PBS -o /nobackupp4/somedir/runtime/pbs

# 20*15 = 300
CORES=300

# Include the system modules
source /usr/share/modules/init/bash

# Include NEX modules path
export MODULEPATH=$MODULEPATH:/nex/modules/files

# Select the modules you need for this run (don't forget pxargs)
# This is just an example 
module load pxargs/0.0.3-mpt.2.06a67_nex
module load comp-intel/2012.0.032 
module load mpi-sgi/mpt.2.06rp16
module load hdfeos/2.61_nex
module load hdf4/4.2.8_nex
module load geo/0.0.1_nex
module load python/2.7.3_nex
module load mysql/5.6.13_nex

# Potentially source your own environment to set env variables your executable may need
source /my/environments/env_test1.sh

# Here is the worklist as discussed above
WORKLIST=/tmp/arglist.txt

# And here is the executable
EXE=/my/tools/bin/process_tile

# Setup log files (not PBS logs, but your executable logs) 
LOGSTDOUT=/nobackupp4/somedir/runtime/log/test_run.stdout.log
LOGSTDERR=/nobackupp4/somedir/runtime/log/test_run.stderr.log

# Set up the temp dir in case there is some core dumps or output not caught by the logs
# otherwise it goes to the home directory, which may not have that much space
TMPDIR=/nobackupp4/somedir/runtime/workdir

# Switch to the temp dir
cd $TMPDIR

# And finally execute all the jobs on 300 IvyBridge cores on Pleiades. 
# If it completes or crashes, yo will get an e-mail about the status
# The below is one line, see the pxargs man pages for the description of the parameters
mpiexec -np $CORES pxargs -a $WORKLIST -p $EXE -v -v -v --random-starts 2-4 --work-analyze 1> $LOGSTDOUT 2> $LOGSTDERR

And now that the PBS script is ready, all is left to do is:

me@linux:> qsub example.pbs

For more information on the various PBS commands see the NAS HECC Knowledge Base on Running with PBS and also the man pages:

me@linux:> man qsub
and
me@linux:> man 7B pbs_resources