Job chains

Batch job chains are a technique for splitting very long running jobs into jobs that have reasonable run time. Technically, this is implemented in a single job script that submits itself before it ends if there is still more work to be done. Use cases for job chains are:

Processing very many independent tasks, see for example: Using jobber in a job chain.
Software that has implemented Checkpoint/Restart.

When setting up a job chain it is important to avoid infinite chains. Two kinds of infinitly running chains are possible:

The job chain cannot be stopped because it is impossible to cancel the a running job with the batch systems’s cancel command. This can happen if runtimes become extremely short because, due to an error, practically no work is performed.
An error occurs after longer runtime but is not detected. In that situation the chain might repeat an erroneous job endlessly and waste a lot of compute time. Here the cancel command would still work, but this kind of error draws one to the necessity of monitoring a job chain.

Advice 1 — explicit filename

Make sure that your job chain can be stopped by renaming the job script! This works only if the filename of the script that resubmits itself appears explicitly in that script.
Pitfalls: Renaming the original job script will not help if the follow-up job is generated by the script and piped into the submit command. Also, $0 should not be used, because in a batch job $0 does not refer to the orginal file but rather to a copy residing in the spooling area of the batch system.

Advice 2 — error handling

The follow-up job shall only be submitted if no errors occured.
The bash flag -e can be set to have the script stop on the first command that returns with non-zero exit status.
If one can rely on the exit status of all commands set -e is sufficient to prevent a next job from being resubmitted . In the exit status is unreliable other correctness checks must be implemented.

Job chain example

The example script below contains a job chain example. It can be copied to an own directory and be submitted as is. The example will run 3 jobs. The criterion for stopping the chain is the value of a job counter. In practice, other stopping criteria can implemented, for an example see: Using jobber in a job chain.

line
no. /sw/batch/examples/job-chain/chain.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51 #!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:02:00
#SBATCH --output=%x.o%j
#SBATCH --export=NONE

source /sw/batch/init.sh

set -eu

# name of this file:
this_file=chain.sh

# sbatch must be called in the directory that keeps $this_file
# initial submission: sbatch $this_file
# interrupt/stop chain: mv $this_file $this_file.stopped
# submission after
# an interruption between jobs: sbatch $this_file $next_job
# a crash of a running job: sbatch $this_file $this_job

# according to --output=%x.o%j log-files will be: $this_file.o$SLURM_JOB_ID

# process job counter
# the counter is set to 1 if no value is given in the command line
declare -i max_jobs=3
declare -i this_job=${1:-1}
declare -i next_job=$((this_job + 1))

# print job counter information
echo "max_jobs : $max_jobs"
echo "this_job : $this_job"
echo "next_job : $next_job"

# emulate work (by doing nothing for 1 minute)
sleep 60

# check whether there is more work to do
if (( next_job <= max_jobs ))
then
# if the working directory has been changed it has to be
# changed back to the directory that keeps $this_file

# in principle, $this_file can be obtained from the batch system,
# i.e. the explicit definition above could be dropped:
# this_file="$(squeue -j "$SLURM_JOB_ID" -h -o %o)"
# submit next job
sbatch "$this_file" $next_job
fi

exit

line no.	/sw/batch/examples/job-chain/chain.sh
`1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11` `12` `13` `14` `15` `16` `17` `18` `19` `20` `21` `22` `23` `24` `25` `26` `27` `28` `29` `30` `31` `32` `33` `34` `35` `36` `37` `38` `39` `40` `41` `42` `43` `44` `45` `46` `47` `48` `49` `50` `51`	`#!/bin/bash` `#SBATCH --ntasks=1` `#SBATCH --time=00:02:00` `#SBATCH --output=%x.o%j` `#SBATCH --export=NONE` `source /sw/batch/init.sh` `set -eu` `# name of this file:` `this_file=chain.sh` `# sbatch must be called in the directory that keeps $this_file` `# initial submission: sbatch $this_file` `# interrupt/stop chain: mv $this_file $this_file.stopped` `# submission after` `# an interruption between jobs: sbatch $this_file $next_job` `# a crash of a running job: sbatch $this_file $this_job` `# according to --output=%x.o%j log-files will be: $this_file.o$SLURM_JOB_ID` `# process job counter` `# the counter is set to 1 if no value is given in the command line` `declare -i max_jobs=3` `declare -i this_job=${1:-1}` `declare -i next_job=$((this_job + 1))` `# print job counter information` `echo "max_jobs : $max_jobs"` `echo "this_job : $this_job"` `echo "next_job : $next_job"` `# emulate work (by doing nothing for 1 minute)` `sleep 60` `# check whether there is more work to do` `if (( next_job <= max_jobs ))` `then` `# if the working directory has been changed it has to be` `# changed back to the directory that keeps $this_file` `# in principle, $this_file can be obtained from the batch system,` `# i.e. the explicit definition above could be dropped:` `# this_file="$(squeue -j "$SLURM_JOB_ID" -h -o %o)"` `# submit next job` `sbatch "$this_file" $next_job` `fi` `exit`