RRZ tools
rrz-batch-jobreport
This page is under construction!
The command rrz-batch-jobreport
displays information for
a batch job with a given job-ID:
rrz-batch-jobreport jobID
Resources shown are: CPU, memory, disk, network and possibly GPU and GPU-memory.
rrz-batch-jobreport
can be called for completed as well
as running batch jobs. At the end of every batch a job-report is written
to the job’s stderr file automatically.
Users should check at least job report summaries regularly.
Motivation
Motivations for checking resource usage are
- to improve resource utilisation,
- to find performance bottlenecks.
Shared and non-shared resources
rrz-batch-jobreport
collects data from the operating
system on a per-node basis. For jobs that used full nodes all resources
of the nodes used are non-shared, i.e. all data displayed by
rrz-batch-jobreport
is data for such a job. For smaller
jobs, which share(d) a node with other jobs, it is important to keep in
mind that some numbers shown by rrz-batch-jobreport
do not
apply to the job but rather the whole node.
Shared resources are:
- network
- disks
Non-shared resources are:
- CPU and memory
- GPU and GPU-memory
Technically this is achieved by not sharing the largest data caches and by putting batch jobs into cgroups.
Summary
The summary gives a quick overview and implies hints for action. It appears at the end of the report because this position makes it easy to find. Examples for a CPU and a GPU job:
Summary: Elapsed time: 7% (0.2 out of 3.0 h timelimit) CPU: 100% (8.0 out of 8 physical CPU cores) Max. main memory: 3% (0.9 out of 31.3 GiB min. available per node)
Summary: Elapsed time: 35% (0.4 out of 1.0 h timelimit) GPU: 70% (0.7 out of 1 GPUs) CPU: 12% (1.0 out of 8 physical CPU cores) Max. GPU memory: 3% (2.4 out of 79.7 GiB per GPU) Max. main memory: 2% (2.9 out of 141.1 GiB min. available per node)
Possible actions are:
- If elapsed time is small compared with the time limit that latter can/should be lowered. Shorter time limits increase the probability to find free slot for a job, which would improve the overall throughput.
- For CPU jobs low CPU utilisation is a primary call for improvement.
- For GPU jobs low GPU utilisation is a primary call for improvement.
- Memory utilisation can tell that smaller nodes could be used or that fewer nodes could be used.