Job chains
Batch job chains are a technique for splitting very long running jobs into jobs that have reasonable run time. Technically, this is implemented in a single job script that submits itself before it ends if there is still more work to be done. Use cases for job chains are:
- Processing very many independent tasks, see for example: Using jobber in a job chain.
- Software that has implemented Checkpoint/Restart.
When setting up a job chain it is important to avoid infinite chains. Two kinds of infinitly running chains are possible:
- The job chain cannot be stopped because it is impossible to cancel the a running job with the batch systems’s cancel command. This can happen if runtimes become extremely short because, due to an error, practically no work is performed.
- An error occurs after longer runtime but is not detected. In that situation the chain might repeat an erroneous job endlessly and waste a lot of compute time. Here the cancel command would still work, but this kind of error draws one to the necessity of monitoring a job chain.
Advice 1 — explicit filename
- Make sure that your job chain can be stopped by renaming the
job script! This works only if the filename of the script that
resubmits itself appears explicitly in that script.
- Pitfalls: Renaming the original job script will not help if the
follow-up job is generated by the script and piped
into the submit command. Also,
$0should not be used, because in a batch job$0does not refer to the orginal file but rather to a copy residing in the spooling area of the batch system.
Advice 2 — error handling
- The follow-up job shall only be submitted if no errors occured.
- The
bashflag-ecan be set to have the script stop on the first command that returns with non-zero exit status. - If one can rely on the exit status of all commands
set -eis sufficient to prevent a next job from being resubmitted . In the exit status is unreliable other correctness checks must be implemented.
Job chain example
The example script below contains a job chain example. It can be copied to an own directory and be submitted as is. The example will run 3 jobs. The criterion for stopping the chain is the value of a job counter. In practice, other stopping criteria can implemented, for an example see: Using jobber in a job chain.
|
line no. |
/sw/batch/examples/job-chain/chain.sh |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051 |
#!/bin/bash#SBATCH --ntasks=1#SBATCH --time=00:02:00#SBATCH --output=%x.o%j#SBATCH --export=NONEsource /sw/batch/init.shset -eu# name of this file:this_file=chain.sh# # initial submission: # interrupt/stop chain: # submission after # an interruption between jobs: # a crash of a running job: # according to # process job counter# the counter is set to 1 if no value is given in the command linedeclare -i max_jobs=3declare -i this_job=${1:-1}declare -i next_job=$((this_job + 1))# print job counter informationecho "max_jobs : $max_jobs"echo "this_job : $this_job"echo "next_job : $next_job"# emulate work (by doing nothing for 1 minute)sleep 60# check whether there is more work to doif (( next_job <= max_jobs ))then # if the working directory has been changed it has to be # changed back to the directory that keeps # in principle, # i.e. the explicit definition above could be dropped: # # submit next job sbatch "fiexit |
|---|