Running independent tasks with jobber

The examples on this page show how to use the RRZ tool jobber in batch jobs. (The jobber page should be read first.) The main idea of jobber is to enable filling compute nodes with single-core tasks:

--nodes should always be 1.
--ntasks-per-node should be the number of physical cores of a compute node. (Fewer cores can be used to increase the available memory per task.)
If more parallelism is needed, the same jobber batch job can be submitted several times, i.e. the number of nodes used for processing a task list is given by the number of batch jobs initially submitted (in contrast to running a single job on several nodes). This possibility is a prominent feature of jobber.

There are three examples:

In order to try the examples a task list needs to be generated first. In all examples the task list is called task.list. For testing we create task lists this way:

shell$ { for i in $(seq 1 100); do echo "sleep 10; echo '--task-$i--'"; done } > task.list

Afterwards each example can be run in batch mode by entering these commands:

shell$ module load jobber
shell$ jobber task.list cleanup init
shell$ sbatch jobber-example.sh

Executing all tasks in a single job

This example shows the simplest way of using jobber in a batch job: all tasks specified in the task list are executed in a single job.

line no.	jobber-all-tasks.sh
`1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11` `12` `13`	`#!/bin/bash` `#SBATCH --nodes=1` `#SBATCH --ntasks-per-node=16` `#SBATCH --time=00:05:00` `#SBATCH --export=NONE` `source /sw/batch/init.sh` `module load jobber` `jobber -p $SLURM_NTASKS_PER_NODE task.list all` `exit`

Executing a fixed number of tasks

In many cases it will be impossible to execute all tasks from the task list in a single job (because this would exceed the batch job's time limit). In such cases the total number of task to be executed can be specified. If execution times per task are similar and tasks are executed in parallel it is more natural to specify the number of tasks that shall be executed per parallel slot (because in this situation the run time of the batch job is approximately given by the number of tasks per slot times the execution time per task). In the example the variable n_tasks_per_parallel_slot contains the number of tasks per parallel slot.

line no.	jobber-n-tasks.sh
`1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11` `12` `13` `14` `15` `16` `17` `18`	`#!/bin/bash` `#SBATCH --nodes=1` `#SBATCH --ntasks-per-node=16` `#SBATCH --time=00:01:00` `#SBATCH --export=NONE` `source /sw/batch/init.sh` `module load jobber` `n_tasks_per_parallel_slot=2` `parallel_slots=$SLURM_NTASKS_PER_NODE` `n_tasks=$(($parallel_slots * $n_tasks_per_parallel_slot))` `jobber -p $parallel_slots task.list $n_tasks` `exit`

Using jobber in a job chain

A batch job chain is a batch job that submits itself again before it ends unless a stop condition is met. Effectively a sequence of jobs is started at submission of the first job on the command line because subsequent jobs will be started automatically. In conjunction with jobber this allows to launch the execution of a long task list with a single submit command. In the example the -e/--endtime option is employed to decide when to stop executing (see also the jobber time-limit example). The end time is obtained from the batch system's squeue command.

Some care must be taken with job chains. In particular, an endless chain must be avoided:

A job chain script should immediately stop if an error occurs (in order to not submit the next job which is expected to run into the same problem again). This is achieved by setting the -eu flags of the shell. (set -x, which is set in addition, helps to trace back problems.)
The stopping mechanism must be robust. The more action of jobber is provided for that purpose.
If re-submissions should go out of control remember that a chain which is implemented like jobber-job-chain.sh can be stopped by renaming the self-submitting script!

line
no. jobber-job-chain.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22 #!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --time=00:01:00
#SBATCH --export=NONE

source /sw/batch/init.sh

module load jobber

set -eux

task_list=task.list
this_file=jobber-job-chain.sh

end_time=$(squeue -h -j $SLURM_JOB_ID -O EndTime)

jobber -p $SLURM_NTASKS_PER_NODE -e "$end_time" "$task_list" all

jobber "$task_list" more && sbatch "$this_file"

exit

line no.	jobber-job-chain.sh
`1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11` `12` `13` `14` `15` `16` `17` `18` `19` `20` `21` `22`	`#!/bin/bash` `#SBATCH --nodes=1` `#SBATCH --ntasks-per-node=16` `#SBATCH --time=00:01:00` `#SBATCH --export=NONE` `source /sw/batch/init.sh` `module load jobber` `set -eux` `task_list=task.list` `this_file=jobber-job-chain.sh` `end_time=$(squeue -h -j $SLURM_JOB_ID -O EndTime)` `jobber -p $SLURM_NTASKS_PER_NODE -e "$end_time" "$task_list" all` `jobber "$task_list" more && sbatch "$this_file"` `exit`