This is an introduction on how to use some of the NEX and NAS tools to easily parallelize your job execution on Pleiades.
This is mainly applicable to running a set of jobs using the same executable, but varying the parameters and/or input data. These are often referred as embarrassingly parallel problems, because there is no dependence and no need to communicate between the jobs in the set during execution - they are completely independent. An example in Earth sciences would be processing of satellite data that is already broken down into tiles (and each tile resides in a separate file) as is often the case and the executable can be used on one tile/file at a time. The following examples will show how to set up and use the combination of PBS and pxargs to execute parallel jobs on Pleiades.
What is pxargs?
The pxargs utility builds and executes command lines from a newline delimited list in a file or from standard input. There are some similarities with the linux/unix xargs utility. In order to use pxargs, you need to set up NEX loadable modules and then you can do:
me@linux:> module load pxargs/0.0.3-mpt.2.06a67_nex
This loads the module and the dependencies - you may check for other versions if needed using
How to use pxargs
An example usage would be (don't do this on the bridge/pfe nodes as this is just an illustration on how it works and will not work without MPI environment, which is setup on the Pleiades compute nodes):
me@linux:> mpiexec -np 16 --arg-file=/tmp/arglist.txt --proc=/my/tools/bin/process_tile
The command starts 16 processes and executes
/my/tools/bin/proces_tile executable on every line in
/tmp/arglist.txt. So assuming that
process_tile is an executable or a script that can be independently run on a single tile like this:
me@linux:> process_tile --tile=/data/MOD13Q1.h08v05.hdf --segment --level=3 -v
The lines in the
arglist.txt file might look like this:
me@linux:> cat arglist.txt --tile=/data/MOD13Q1.h08v05.hdf --segment --level=3 -v --tile=/data/MOD13Q1.h09v05.hdf --segment --level=3 -v --tile=/data/MOD13Q1.h09v06.hdf --segment --level=3 -v --tile=/data/MOD13Q1.h12v05.hdf --segment --level=3 -v --tile=/data/MOD13Q1.h12v04.hdf --segment --level=3 -v ...
During the execution, each of the line (which is just a parametrization of the execution) is prefixed with the executable or script and executed in a separate process.
IMPORTANT: Note that you can submit lot more jobs than the number of processes. In the fictional example above we could use 16 processes to execute 100's of jobs = 100's of lines in the arglist.txt file and they would be distributed evenly across the 16 processes and executed 15 at a time (15 because pxargs itself lives in one of the processes)
For more information on pxargs, please see the man pages
me@linux:> man pxargs
Combining PBS with pxargs for large-scale executions
Before diving into the PBS+pxargs topic, it is advisable to familiarize yourself with the NAS supercomputing architecture and best practices, such as:
- Pleiades computing hardware
- Pleiades compute nodes and their memory and core configurations
- Pleiades queues and scheduling overview
- Running jobs with PBS on Pleiades
Here is an example PBS script with pxargs:
me@linux:> cat example.pbs #PBS -S /bin/bash # Give the run some name (up to 15 characters so that it can be easily found in the logs) #PBS -N proc_tiles_v1 # In this example - select 20 IvyBridge nodes and on each node run 15 processes # That is total of 20*15 = 300 cores #PBS -l select=20:ncpus=15:mpiprocs=15:model=ivy # The walltime of the job (once it gets scheduled in) will be less than 8 hours, # otherwise it will get killed #PBS -l walltime=8:00:00 # Under which gid is this job executed (just run groups to see your gids) # And schedule the job in the "normal" queue with 8-hour runtime limit #PBS -W group_list=s1007 #PBS -q normal # Setup logging #PBS -j oe #PBS -m e #PBS -V #PBS -e /nobackupp4/somedir/runtime/pbs #PBS -o /nobackupp4/somedir/runtime/pbs # 20*15 = 300 CORES=300 # Include the system modules source /usr/share/modules/init/bash # Include NEX modules path export MODULEPATH=$MODULEPATH:/nex/modules/files # Select the modules you need for this run (don't forget pxargs) # This is just an example module load pxargs/0.0.3-mpt.2.06a67_nex module load comp-intel/2012.0.032 module load mpi-sgi/mpt.2.06rp16 module load hdfeos/2.61_nex module load hdf4/4.2.8_nex module load geo/0.0.1_nex module load python/2.7.3_nex module load mysql/5.6.13_nex # Potentially source your own environment to set env variables your executable may need source /my/environments/env_test1.sh # Here is the worklist as discussed above WORKLIST=/tmp/arglist.txt # And here is the executable EXE=/my/tools/bin/process_tile # Setup log files (not PBS logs, but your executable logs) LOGSTDOUT=/nobackupp4/somedir/runtime/log/test_run.stdout.log LOGSTDERR=/nobackupp4/somedir/runtime/log/test_run.stderr.log # Set up the temp dir in case there is some core dumps or output not caught by the logs # otherwise it goes to the home directory, which may not have that much space TMPDIR=/nobackupp4/somedir/runtime/workdir # Switch to the temp dir cd $TMPDIR # And finally execute all the jobs on 300 IvyBridge cores on Pleiades. # If it completes or crashes, yo will get an e-mail about the status # The below is one line, see the pxargs man pages for the description of the parameters mpiexec -np $CORES pxargs -a $WORKLIST -p $EXE -v -v -v --random-starts 2-4 --work-analyze 1> $LOGSTDOUT 2> $LOGSTDERR
And now that the PBS script is ready, all is left to do is:
me@linux:> qsub example.pbs
For more information on the various PBS commands see the NAS HECC Knowledge Base on Running with PBS and also the man pages:
me@linux:> man qsuband
me@linux:> man 7B pbs_resources