RRZ tools
rrz-batch-jobreport
The command rrz-batch-jobreport
displays information for
a batch job with a given job-ID:
rrz-batch-jobreport jobID
Resources shown are: CPU, memory, disk, network and possibly GPU and GPU-memory.
rrz-batch-jobreport
can be called for completed as well
as for running batch jobs. At the end of every batch a job-report is
written to the job’s stderr file automatically.
Users should check at least job report summaries regularly.
Overview
- Motivation
- Limitations
- Shared and non-shared resources
- Taking advantage of job reports
- The job report explained
Motivation
Motivations for checking resource usage are
- to improve resource utilisation (in particular of CPUs, GPUs and RAM),
- to find performance bottlenecks (in particular for I/O).
Limitations
Job reports must not be confused with performance reports. Job reports only check utilisation rather than performance. The goal is to achieve good performance. Good utilisation is a necessary but not sufficient condition for performance: performance can be low even if CPU or GPU utilisation looks good. For completeness it should be mentioned that even performance is not the ultimate criterion. The latter is the shortest possible execution time (and today also low energy consumption). This requires to consider algorithms in the first place. Shorter executions times might be achieved with a better algorithm although performance (measured in operations per second) might be lower.
It is important to know that parallel performance (good scaling) is not guaranteed if all CPU cores are busy. Therefore, explicit scaling tests are mandatory.
Another limitation is the reproducibility of time measurements. Time measurements are only reasonably reproducible if hardware is used exclusively, which, in general, is not the case on a cluster where hardware is shared with other users. Nonetheless, parts of the hardware can be consider to be provided non-shared.
Shared and non-shared resources
rrz-batch-jobreport
collects data from the operating
system on a per-node basis (in contrast to per-job).
For jobs that use(d) full nodes all resources of the nodes used are
non-shared, i.e. all data displayed by
rrz-batch-jobreport
is data of the corresponding job. For
smaller jobs, which share(d) a node with other jobs, it is important to
keep in mind that some numbers shown by rrz-batch-jobreport
do not apply to the job but rather to the whole node.
Shared resources
These resources are always shared (unless the whole cluster is used exclusively):
- network
- disks of distributed file systems
Non-shared resources
Typically, a node of a cluster is a non-shared resource, i.e. non-shared resources can be:
- CPU and memory (RAM)
- GPU and GPU-memory
- local disks (not available on Hummel-2)
Hummel-2 is configured in such a way that CPU, GPU and memory can in practice be considered to be non-shared resources. Technically this is achieved by not sharing the largest data caches and by putting batch jobs into cgroups.
Taking advantage of job reports
Job reports can help to improve machine utilisation in the following ways:
CPU and GPU utilisation. CPUs and GPUs are the most expensive components of an HPC system. Therefore, one should strive for high CPU or GPU utilisation, respectively. There are two main reasons for under-utilisation:
- some CPUs/GPUs are not used at all (typically as a consequence of a bad job specification)
- CPUs/GPUs are waiting for disk I/O operations (reasons can be that disks are heavily used by other users or that the program itself has an I/O bottleneck)
- there is not enough parallel work to fully utilise the CPU/GPU (recall that both are parallel computing devices by themselves)
Memory high water mark. Peak memory usage determines whether a program fits into the memory of a compute node or a GPU. If this value is known, smaller compute nodes can be used, if available, or more load could be put onto a GPU, if its compute capability is under-utilised, too. The value can also be used to estimate the maximal problem size that would fit into a given type of node or GPU.
Disk usage. The reports show how much data is read
from and written to disk systems. Unfortunately, the disk usage part is
harder to understand than CPU, GPU and memory. Nonetheless, it should be
possible to figure out whether a job is slowed down by heavy I/O.
Note that I/O to the virtual file systems of compute
nodes (i.e. /tmp
and /dev/shm
) is not counted
as I/O by the operating system (because it is memory traffic) and hence
cannot not appear in the report.
Communication network traffic influences the compute performance of multi-node jobs and I/O to distributed file systems.
The job report explained
In following the job report is explained section by section. Differences to GPU- and multi-node jobs reports are explained as well.
Header
The header contains the following information: SLURM partition
(std
or gpu
in the example below), job ID, CPU
and GPU type, elapsed (wall clock) time, and machine usage in node-hours
.
- CPU job
══════════════════════════════════════════════════ RRZ: Overview of std job xxxxxx resources (v39.30) ────────────────────────────────────────────────── CPU: AMD EPYC 9654 96-Core Processor ────────────────────────────────────────────────── Elapsed runtime: 1158 s (0.3 node-hours)
- GPU job
══════════════════════════════════════════════════ RRZ: Overview of gpu job xxxxxx resources (v39.30) ────────────────────────────────────────────────── CPU: AMD EPYC 9334 32-Core Processor GPU: NVIDIA H100 80GB HBM3 ────────────────────────────────────────────────── Elapsed runtime: 1277 s (0.4 node-hours)
Data amounts
The report shows how much data is moved from and to the file systems, and the amount of data that was transferred over the data communication network.
I/O data refers to whole nodes, but CPU, GPU, and memory to the job. Total node filesystem I/O: 4.4 GiB Total reads from NFS: 5.0 MiB Total reads from /beegfs: 4.2 GiB Total writes to /beegfs: 214.4 MiB Total inter-node communication: 230.7 MiB I/O explained by this job's tasks: 5.8 MiB
- The first line is a summary of the above paragraph Shared and non-shared resources.
- The second line gives the sum of the three subsequent lines.
- Lines 3 to 6 are self-explaining.
- The last line is an exception: it refers to the job reported on (and not to whole nodes).
If the data amount printed in the last line exceeds the size of the
job’s input and output files considerably, the job has a lot of scratch
I/O. In this case it is worthwhile to check where scratch files are
created. $TMPDIR
or $SSD
should be used when
processing many small files.
Resources per node
Until recently a node was the typical scheduling unit on HPC clusters. Because Hummel-2 has huge nodes (compare 16 cores per node of the predecessor with 192), most jobs are expected to fit into a single node, i.e. resources per node translates to resources per job in many cases.
The following resources are listed in a table, one line per node:
cpu
: the number of CPU cores effectively usedgpu
: ratio Usage/Cap of powerUsage
and power capabilityCap
printed bynvidia-smi
, averaged over the whole runtimemem
: high water mark of memory usedgmem
: high water mark of GPU memory usedswap
: high water mark of swap space used (does not apply to Hummel-2)
disk
: average I/O bandwidth to local disks (does not apply to Hummel-2)
iops
: average IOPs rate to local disks (does not apply to Hummel-2)/beegfs
: bandwidth of I/O to and from/beegfs
, averaged over the whole runtimecomm
: data communication bandwidth, averaged over the whole runtime
Examples:
- CPU job
Average gross compute, I/O, communication load, maximum memory use: node │ cpu mem swap disk iops /beegfs comm │ GiB GiB MiB/s 1/s MiB/s MiB/s ═══════╪════════════════════════════════════════════ n121 │ 8.0 0.9 0.0 0 0 0 0.1 ───────┼──────────────────────────────────────────── sum: 1 │ 8.0 0.9 0.0 0 0 0 0.1
- GPU job
Average gross compute, I/O, communication load, maximum memory use: node │ cpu gpu mem gmem swap disk iops /beegfs comm │ GiB GiB GiB MiB/s 1/s MiB/s MiB/s ═══════╪═══════════════════════════════════════════════════════ g003 │ 1.0 0.7 2.9 2.4 0.0 0 0 0 0.1 ───────┼─────────────────────────────────────────────────────── sum: 1 │ 1.0 0.7 2.9 2.4 0.0 0 0 0 0.1
CPU usage per core
This table contains these columns:
socket
: physical CPU socket (0 or 1)core
: core number on that socketuser
: fraction of CPU time spent in user-modesys
: fraction of CPU time spent in kernel-modeiowait
: fraction of time spent in outstanding I/O requestsidle
: fraction of time the core was idle
Example:
Average CPU physical core usage, mean over all nodes: socket core │ user sys iowait idle ═════════════╪═════════════════════════ 0 32 │ 0.98 0.01 0.00 0.00 0 33 │ 0.98 0.01 0.00 0.00 0 34 │ 0.97 0.03 0.00 0.00 0 35 │ 0.99 0.00 0.00 0.00 0 36 │ 0.99 0.01 0.00 0.00 0 37 │ 0.98 0.02 0.00 0.00 0 38 │ 0.97 0.03 0.00 0.00 0 39 │ 0.99 0.00 0.00 0.00 ─────────────┼───────────────────────── sum: 1 8 │ 7.85 0.11 0.00 0.00
From the table one can learn:
- whether all cores were used
- whether the CPU utilisation is high
- how balanced the load was
- which cores were allocated
- how the cores were distributed over the sockets
Recall that high CPU utilisation is a necessary but not sufficient condition for good performance. In a multi-processor program, for example, processes or threads can be in busy-waiting states, where CPUs are busy but no work is done.
GPU usage per GPU
This table contains information from nvidia-smi:
device
: number of GPU in the nodepower load
: power usage divided by maximal powerGPU load
: fraction of time during which one or more kernels were executing on the GPU.memory load
: fraction of time during which global (device) memory was being read or written.memory use
: high water mark of GPU memory used
Good example:
Average GPU load over all nodes, maximum GPU memory use: device │ power load │ GPU load │ memory load │ memory use │ │ │ │ GiB ═══════╪════════════╪══════════╪═════════════╪═══════════ 7 │ 0.59 │ 0.93 │ 0.76 │ 30.4 ───────┼────────────┼──────────┼─────────────┼─────────── sum: 1 │ 0.59 │ 0.93 │ 0.76 │ 30.4
The GPU load should be high. Like on CPUs this does not tell the whole story, because the GPU counts as being utilised if only 1 core is working. But the GPUs of Hummel-2 have 16896 FP32 CUDA cores each. The power load gives a good indication on how much of the hardware was utilised. But it will be hard to reach 0.8 or higher. In idle state the power load is roughly 10%. If the power load is not significantly higher than 0.1 one must conclude that the GPUs are clearly under-utilised, see the following bad examples, where using CPUs instead of a GPU would have been more economical:
Bad example 1:
device │ power load │ GPU load │ memory load │ memory use │ │ │ │ GiB ═══════╪════════════╪══════════╪═════════════╪═══════════ 1 │ 0.10 │ 0.00 │ 0.32 │ 60.1 ───────┼────────────┼──────────┼─────────────┼─────────── sum: 1 │ 0.10 │ 0.00 │ 0.32 │ 60.1
Bad example 2:
device │ power load │ GPU load │ memory load │ memory use │ │ │ │ GiB ═══════╪════════════╪══════════╪═════════════╪═══════════ 1 │ 0.10 │ 0.13 │ 0.00 │ 60.5 ───────┼────────────┼──────────┼─────────────┼─────────── sum: 1 │ 0.10 │ 0.13 │ 0.00 │ 60.5
Resource usage per command
This table shows the commands that used most resources and how often they were called. In many cases this can be a single command. But there can be quite a few commands in the list. In any case one can check for commands generating overhead.
The meaning of the columns are:
cpu
,mem
andio
: percentage of the overall (job) resource usage by all calls of this commandmaxrss
andmaxvm
: high water mark of real/virtual memory usage of all calls of this commandtasks
andprocs
: how often this command was calledcommand
: command name
Net resource usage per command: - CPU, memory, I/O relative to whole CPU, memory, task I/O - maximum RSS and virtual memory - task (thread) and process counts cpu mem io │ maxrss maxvm │ tasks procs │ command % % % │ GiB GiB │ │ ═════════════════╪═══════════════╪══════════════╪════════════════ 99.9 98.5 52.2 │ 0.2 0.4 │ 28 8 │ a.out
Summary
The summary gives a quick overview and implies hints for action. It appears at the end of the report because this position makes it easy to find.
Examples:
- CPU job
Summary: Elapsed time: 7% (0.2 out of 3.0 h timelimit) CPU: 100% (8.0 out of 8 physical CPU cores) Max. main memory: 3% (0.9 out of 31.3 GiB min. available per node) ────────────────── RRZ: End Of Report
- GPU job (
GPU:
gives the power load, see GPU usage per GPU)
Summary: Elapsed time: 14% (1.2 out of 9.0 h timelimit) GPU: 59% (0.6 out of 1 GPUs) CPU: 12% (1.0 out of 8 physical CPU cores) Max. main memory: 1% (1.3 out of 141.1 GiB min. available per node) ────────────────── RRZ: End Of Report
Primary calls for improvement are:
- CPU jobs: CPU utilisation below 80%
- GPU jobs: GPU load below 40%
Other desirable actions are:
- Memory utilisation tells whether less resources (fewer cores, fewer nodes or smaller nodes) could be used.
- If elapsed time is small compared with the time limit that latter can/should be lowered. Shorter time limits increase the probability to find free slot for a job, which would improve the overall throughput (cf. backfill scheduling).