Running independent tasks with jobber
The examples on this page show how to use the RRZ tool jobber in batch jobs. (The jobber page should be read first.) The main idea of jobber is to enable filling compute nodes with single-core tasks:
--nodes
should always be1
.--ntasks-per-node
should be the number of physical cores of a compute node. (Fewer cores can be used to increase the available memory per task.)- If more parallelism is needed, the same jobber batch job can be submitted several times, i.e. the number of nodes used for processing a task list is given by the number of batch jobs initially submitted (in contrast to running a single job on several nodes). This possibility is a prominent feature of jobber.
There are three examples:
- Executing all tasks in a single job:
jobber-all-tasks.sh
- Executing a fixed number of tasks:
jobber-n-tasks.sh
- Using jobber in a job chain:
jobber-job-chain.sh
In order to try the examples a task list needs to be generated first. In all examples the task list is called task.list
. For testing we create task lists this way:
shell$ { for i in $(seq 1 100); do echo "sleep 10; echo '--task-$i--'"; done } > task.list
Afterwards each example can be run in batch mode by entering these commands:
shell$ module load jobber shell$ jobber task.list cleanup init shell$ sbatch jobber-example.sh
Executing all tasks in a single job
This example shows the simplest way of using jobber in a batch job: all tasks specified in the task list are executed in a single job.
line no. |
jobber-all-tasks.sh |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
#!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=16 #SBATCH --time=00:05:00 #SBATCH --export=NONE
source /sw/batch/init.sh
module load jobber
jobber -p $SLURM_NTASKS_PER_NODE task.list all
exit |
---|
Executing a fixed number of tasks
In many cases it will be impossible to execute all tasks from the task list in a single job (because this would exceed the batch job's time limit). In such cases the total number of task to be executed can be specified. If execution times per task are similar and tasks are executed in parallel it is more natural to specify the number of tasks that shall be executed per parallel slot (because in this situation the run time of the batch job is approximately given by the number of tasks per slot times the execution time per task). In the example the variable n_tasks_per_parallel_slot
contains the number of tasks per parallel slot.
line no. |
jobber-n-tasks.sh |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
#!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=16 #SBATCH --time=00:01:00 #SBATCH --export=NONE
source /sw/batch/init.sh
module load jobber
n_tasks_per_parallel_slot=2
parallel_slots=$SLURM_NTASKS_PER_NODE n_tasks=$(($parallel_slots * $n_tasks_per_parallel_slot))
jobber -p $parallel_slots task.list $n_tasks
exit |
---|
Using jobber in a job chain
A batch job chain is a batch job that submits itself again before it ends unless a stop condition is met. Effectively a sequence of jobs is started at submission of the first job on the command line because subsequent jobs will be started automatically. In conjunction with jobber this allows to launch the execution of a long task list with a single submit command. In the example the -e
/--endtime
option is employed to decide when to stop executing (see also the jobber time-limit example). The end time is obtained from the batch system's squeue
command.
Some care must be taken with job chains. In particular, an endless chain must be avoided:
- A job chain script should immediately stop if an error occurs (in order to not submit the next job which is expected to run into the same problem again). This is achieved by
set
ting the-eu
flags of the shell. (set -x
, which is set in addition, helps to trace back problems.) - The stopping mechanism must be robust. The
more
action of jobber is provided for that purpose. - If re-submissions should go out of control remember that a chain which is implemented like
jobber-job-chain.sh
can be stopped by renaming the self-submitting script!
line no. |
jobber-job-chain.sh |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
#!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=16 #SBATCH --time=00:01:00 #SBATCH --export=NONE
source /sw/batch/init.sh
module load jobber
set -eux
task_list=task.list this_file=jobber-job-chain.sh
end_time=$(squeue -h -j $SLURM_JOB_ID -O EndTime)
jobber -p $SLURM_NTASKS_PER_NODE -e "$end_time" "$task_list" all
jobber "$task_list" more && sbatch "$this_file"
exit |
---|