Running independent tasks with srun
This page contains examples for running similar (non-parallel) tasks in parallel in order to use all CPU resources of compute nodes. There are two sections:
- Running independent tasks in parallel demonstrates the principle using shell/bash constructs.
- Parallel processing of independent tasks with
srun
shows a more elegant and flexible solution.
Running independent tasks in parallel
The two-tasks-job, shown below, demonstrates the principle of running independent tasks in parallel:
- The
sleep
program is used to emulate work. - Processes are started in the background by using the & control character.
- The wait command waits for completion of all background processes.
- The
&
-plus-wait
construct works only on a single node!
line no. |
two-tasks-job.sh |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
#!/bin/bash #SBATCH --nodes=1 #SBATCH --time=00:02:00 #SBATCH --export=NONE
source /sw/batch/init.sh
sleep 10 & # start first process in the background sleep 10 & # start second process in the background ps # check that two sleep processes are running wait # wait for completion of all background processes
echo "elapsed time: $SECONDS s" # will be 10 s
exit |
---|
In a real batch job one would start 1 process per core and make sure that each process uses its own files, for example:
executable1 < inputFile1 > outputFile1 2> errorMessages1 & executable2 < inputFile2 > outputFile2 2> errorMessages2 & ... executable16 < inputFile16 > outputFile16 2> errorMessages16 & wait
Parallel processing of independent tasks with srun
In this example the same (kind of) task is started multiple times with srun
.
srun
starts as many tasks as are specified bysbatch
options. In the example these options are--nodes=1
and--ntasks-per-node=16
. In this casesrun
would start the executabledemo-task.sh
16 times on 1 node.- Option
--kill-on-bad-exit=0
preventssrun
from terminating all tasks if one of the executables exits with error status. - Option
--cpu-bind=cores
binds each task to a (different) core. (Process-binding is an HPC optimization.)
line no. |
n-tasks-job.sh |
1 2 3 4 5 6 7 8 9 10 11 |
#!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=16 #SBATCH --time=00:02:00 #SBATCH --export=NONE
source /sw/batch/init.sh
srun --kill-on-bad-exit=0 --cpu-bind=cores ./demo-task.sh
exit |
---|
srun
starts the same executable in parallel. The executable can use the environment variable
SLURM_PROCID
to determine which work to do (which files to process). SLURM_PROCID
takes values from 0 to the total number of executables started minus 1
(in this example from 0 to 15).
line no. |
demo-task.sh |
1 2 3 4 5 |
#!/bin/bash
echo "this is task $SLURM_PROCID"
exit |
---|