Running independent tasks with srun
This page contains examples for running similar (non-parallel) tasks in parallel in order to use all CPU resources of compute nodes. There are two sections:
- Running independent tasks in parallel demonstrates the principle using shell/bash constructs.
- Parallel processing of independent tasks with
srunshows a more elegant and flexible solution.
Running independent tasks in parallel
The two-tasks-job, shown below, demonstrates the principle of running independent tasks in parallel:
- The
sleepprogram is used to emulate work. - Processes are started in the background by using the & control character.
- The wait command waits for completion of all background processes.
- The
&-plus-waitconstruct works only on a single node!
|
line no. |
two-tasks-job.sh |
123456789101112131415 |
#!/bin/bash#SBATCH --nodes=1#SBATCH --time=00:02:00#SBATCH --export=NONEsource /sw/batch/init.shsleep 10 & # start first process in the backgroundsleep 10 & # start second process in the backgroundps # check that two sleep processes are runningwait # wait for completion of all background processesecho "elapsed time: $SECONDS s" # will be 10 sexit |
|---|
In a real batch job one would start 1 process per core and make sure that each process uses its own files, for example:
executable1 < inputFile1 > outputFile1 2> errorMessages1 & executable2 < inputFile2 > outputFile2 2> errorMessages2 & ... executable16 < inputFile16 > outputFile16 2> errorMessages16 & wait
Parallel processing of independent tasks with srun
In this example the same (kind of) task is started multiple times with srun.
srunstarts as many tasks as are specified bysbatchoptions. In the example these options are--nodes=1and--ntasks-per-node=16. In this casesrunwould start the executabledemo-task.sh16 times on 1 node.- Option
--kill-on-bad-exit=0preventssrunfrom terminating all tasks if one of the executables exits with error status. - Option
--cpu-bind=coresbinds each task to a (different) core. (Process-binding is an HPC optimization.)
|
line no. |
n-tasks-job.sh |
1234567891011 |
#!/bin/bash#SBATCH --nodes=1#SBATCH --ntasks-per-node=16#SBATCH --time=00:02:00#SBATCH --export=NONEsource /sw/batch/init.shsrun --kill-on-bad-exit=0 --cpu-bind=cores ./demo-task.shexit |
|---|
srun starts the same executable in parallel. The executable can use the environment variable
SLURM_PROCID to determine which work to do (which files to process). SLURM_PROCID
takes values from 0 to the total number of executables started minus 1
(in this example from 0 to 15).
|
line no. |
demo-task.sh |
12345 |
#!/bin/bashecho "this is task $SLURM_PROCID"exit |
|---|