RRZ tools

rrz-batch-jobreport

The command rrz-batch-jobreport displays information for a batch job with a given job-ID:

rrz-batch-jobreport jobID

Resources shown are: CPU, memory, disk, network and possibly GPU and GPU-memory.

rrz-batch-jobreport can be called for completed as well as for running batch jobs. At the end of every batch a job-report is written to the job’s stderr file automatically.

Users should check at least job report summaries regularly.

Overview

Motivation

Motivations for checking resource usage are

to improve resource utilisation (in particular of CPUs, GPUs and RAM),
to find performance bottlenecks (in particular for I/O).

Limitations

Job reports must not be confused with performance reports. Job reports only check utilisation rather than performance. The goal is to achieve good performance. Good utilisation is a necessary but not sufficient condition for performance: performance can be low even if CPU or GPU utilisation looks good. For completeness it should be mentioned that even performance is not the ultimate criterion. The latter is the shortest possible execution time (and today also low energy consumption). This requires to consider algorithms in the first place. Shorter executions times might be achieved with a better algorithm although performance (measured in operations per second) might be lower.

It is important to know that parallel performance (good scaling) is not guaranteed if all CPU cores are busy. Therefore, explicit scaling tests are mandatory.

Another limitation is the reproducibility of time measurements. Time measurements are only reasonably reproducible if hardware is used exclusively, which, in general, is not the case on a cluster where hardware is shared with other users. Nonetheless, parts of the hardware can be consider to be provided non-shared.

Shared and non-shared resources

rrz-batch-jobreport collects data from the operating system on a per-node basis (in contrast to per-job). For jobs that use(d) full nodes all resources of the nodes used are non-shared, i.e. all data displayed by rrz-batch-jobreport is data of the corresponding job. For smaller jobs, which share(d) a node with other jobs, it is important to keep in mind that some numbers shown by rrz-batch-jobreport do not apply to the job but rather to the whole node.

Shared resources

These resources are always shared (unless the whole cluster is used exclusively):

network
disks of distributed file systems

Non-shared resources

Typically, a node of a cluster is a non-shared resource, i.e. non-shared resources can be:

CPU and memory (RAM)
GPU and GPU-memory
local disks (not available on Hummel-2)

Hummel-2 is configured in such a way that CPU, GPU and memory can in practice be considered to be non-shared resources. Technically this is achieved by not sharing the largest data caches and by putting batch jobs into cgroups.

Taking advantage of job reports

Job reports can help to improve machine utilisation in the following ways:

CPU and GPU utilisation. CPUs and GPUs are the most expensive components of an HPC system. Therefore, one should strive for high CPU or GPU utilisation, respectively. There are two main reasons for under-utilisation:

some CPUs/GPUs are not used at all (typically as a consequence of a bad job specification)
CPUs/GPUs are waiting for disk I/O operations (reasons can be that disks are heavily used by other users or that the program itself has an I/O bottleneck)
there is not enough parallel work to fully utilise the CPU/GPU (recall that both are parallel computing devices by themselves)

Memory high water mark. Peak memory usage determines whether a program fits into the memory of a compute node or a GPU. If this value is known, smaller compute nodes can be used, if available, or more load could be put onto a GPU, if its compute capability is under-utilised, too. The value can also be used to estimate the maximal problem size that would fit into a given type of node or GPU.

Disk usage. The reports show how much data is read from and written to disk systems. Unfortunately, the disk usage part is harder to understand than CPU, GPU and memory. Nonetheless, it should be possible to figure out whether a job is slowed down by heavy I/O. Note that I/O to the virtual file systems of compute nodes (i.e. /tmp and /dev/shm) is not counted as I/O by the operating system (because it is memory traffic) and hence cannot not appear in the report.

Communication network traffic influences the compute performance of multi-node jobs and I/O to distributed file systems.

The job report explained

In following the job report is explained section by section. Differences to GPU- and multi-node jobs reports are explained as well.

The header contains the following information: SLURM partition (std or gpu in the example below), job ID, CPU and GPU type, elapsed (wall clock) time, and machine usage in node-hours .

CPU job

══════════════════════════════════════════════════
RRZ: Overview of std job xxxxxx resources (v39.30)
──────────────────────────────────────────────────
CPU: AMD EPYC 9654 96-Core Processor
──────────────────────────────────────────────────

Elapsed runtime: 1158 s (0.3 node-hours)

GPU job

══════════════════════════════════════════════════
RRZ: Overview of gpu job xxxxxx resources (v39.30)
──────────────────────────────────────────────────
CPU: AMD EPYC 9334 32-Core Processor
GPU: NVIDIA H100 80GB HBM3
──────────────────────────────────────────────────

Elapsed runtime: 1277 s (0.4 node-hours)

Data amounts

The report shows how much data is moved from and to the file systems, and the amount of data that was transferred over the data communication network.

I/O data refers to whole nodes, but CPU, GPU, and memory to the job.

     Total node filesystem I/O:   4.4 GiB
          Total reads from NFS:   5.0 MiB
      Total reads from /beegfs:   4.2 GiB
       Total writes to /beegfs: 214.4 MiB
Total inter-node communication: 230.7 MiB

I/O explained by this job's tasks: 5.8 MiB

The first line is a summary of the above paragraph Shared and non-shared resources.
The second line gives the sum of the three subsequent lines.
Lines 3 to 6 are self-explaining.
The last line is an exception: it refers to the job reported on (and not to whole nodes).

If the data amount printed in the last line exceeds the size of the job’s input and output files considerably, the job has a lot of scratch I/O. In this case it is worthwhile to check where scratch files are created. $TMPDIR or $SSD should be used when processing many small files.

Resources per node

Until recently a node was the typical scheduling unit on HPC clusters. Because Hummel-2 has huge nodes (compare 16 cores per node of the predecessor with 192), most jobs are expected to fit into a single node, i.e. resources per node translates to resources per job in many cases.

The following resources are listed in a table, one line per node:

cpu: the number of CPU cores effectively used
gpu: ratio Usage/Cap of power Usage and power capability Cap printed by nvidia-smi, averaged over the whole runtime
mem: high water mark of memory used
gmem: high water mark of GPU memory used
swap: high water mark of swap space used (does not apply to Hummel-2)
disk: average I/O bandwidth to local disks (does not apply to Hummel-2)
iops: average IOPs rate to local disks (does not apply to Hummel-2)
/beegfs: bandwidth of I/O to and from /beegfs, averaged over the whole runtime
comm: data communication bandwidth, averaged over the whole runtime

Examples:

CPU job

Average gross compute, I/O, communication load, maximum memory use:

node   │ cpu  mem  swap   disk  iops  /beegfs   comm
       │      GiB   GiB  MiB/s   1/s    MiB/s  MiB/s
═══════╪════════════════════════════════════════════
n121   │ 8.0  0.9   0.0      0     0        0    0.1
───────┼────────────────────────────────────────────
sum: 1 │ 8.0  0.9   0.0      0     0        0    0.1

GPU job

Average gross compute, I/O, communication load, maximum memory use:

node   │ cpu  gpu  mem  gmem  swap   disk  iops  /beegfs   comm
       │           GiB   GiB   GiB  MiB/s   1/s    MiB/s  MiB/s
═══════╪═══════════════════════════════════════════════════════
g003   │ 1.0  0.7  2.9   2.4   0.0      0     0        0    0.1
───────┼───────────────────────────────────────────────────────
sum: 1 │ 1.0  0.7  2.9   2.4   0.0      0     0        0    0.1

CPU usage per core

This table contains these columns:

socket: physical CPU socket (0 or 1)
core: core number on that socket
user: fraction of CPU time spent in user-mode
sys: fraction of CPU time spent in kernel-mode
iowait: fraction of time spent in outstanding I/O requests
idle: fraction of time the core was idle

Example:

Average CPU physical core usage, mean over all nodes:

socket  core │ user   sys  iowait  idle
═════════════╪═════════════════════════
     0    32 │ 0.98  0.01    0.00  0.00
     0    33 │ 0.98  0.01    0.00  0.00
     0    34 │ 0.97  0.03    0.00  0.00
     0    35 │ 0.99  0.00    0.00  0.00
     0    36 │ 0.99  0.01    0.00  0.00
     0    37 │ 0.98  0.02    0.00  0.00
     0    38 │ 0.97  0.03    0.00  0.00
     0    39 │ 0.99  0.00    0.00  0.00
─────────────┼─────────────────────────
sum: 1     8 │ 7.85  0.11    0.00  0.00

From the table one can learn:

whether all cores were used
whether the CPU utilisation is high
how balanced the load was
which cores were allocated
how the cores were distributed over the sockets

Recall that high CPU utilisation is a necessary but not sufficient condition for good performance. In a multi-processor program, for example, processes or threads can be in busy-waiting states, where CPUs are busy but no work is done.

GPU usage per GPU

This table contains information from nvidia-smi:

device: number of GPU in the node
power load: power usage divided by maximal power
GPU load: fraction of time during which one or more kernels were executing on the GPU.
memory load: fraction of time during which global (device) memory was being read or written.
memory use: high water mark of GPU memory used

Good example:

Average GPU load over all nodes, maximum GPU memory use:

device │ power load │ GPU load │ memory load │ memory use
       │            │          │             │        GiB
═══════╪════════════╪══════════╪═════════════╪═══════════
     7 │       0.59 │     0.93 │        0.76 │       30.4
───────┼────────────┼──────────┼─────────────┼───────────
sum: 1 │       0.59 │     0.93 │        0.76 │       30.4

The GPU load should be high. Like on CPUs this does not tell the whole story, because the GPU counts as being utilised if only 1 core is working. But the GPUs of Hummel-2 have 16896 FP32 CUDA cores each. The power load gives a good indication on how much of the hardware was utilised. But it will be hard to reach 0.8 or higher. In idle state the power load is roughly 10%. If the power load is not significantly higher than 0.1 one must conclude that the GPUs are clearly under-utilised, see the following bad examples, where using CPUs instead of a GPU would have been more economical:

Bad example 1:

device │ power load │ GPU load │ memory load │ memory use
       │            │          │             │        GiB
═══════╪════════════╪══════════╪═════════════╪═══════════
     1 │       0.10 │     0.00 │        0.32 │       60.1
───────┼────────────┼──────────┼─────────────┼───────────
sum: 1 │       0.10 │     0.00 │        0.32 │       60.1

Bad example 2:

device │ power load │ GPU load │ memory load │ memory use
       │            │          │             │        GiB
═══════╪════════════╪══════════╪═════════════╪═══════════
     1 │       0.10 │     0.13 │        0.00 │       60.5
───────┼────────────┼──────────┼─────────────┼───────────
sum: 1 │       0.10 │     0.13 │        0.00 │       60.5

Resource usage per command

This table shows the commands that used most resources and how often they were called. In many cases this can be a single command. But there can be quite a few commands in the list. In any case one can check for commands generating overhead.

The meaning of the columns are:

cpu, mem and io: percentage of the overall (job) resource usage by all calls of this command
maxrss and maxvm: high water mark of real/virtual memory usage of all calls of this command
tasks and procs: how often this command was called
command: command name

Net resource usage per command:
  - CPU, memory, I/O relative to whole CPU, memory, task I/O
  - maximum RSS and virtual memory
  - task (thread) and process counts

 cpu   mem    io │ maxrss  maxvm │ tasks  procs │ command        
   %     %     % │    GiB    GiB │              │                
═════════════════╪═══════════════╪══════════════╪════════════════
99.9  98.5  52.2 │    0.2    0.4 │    28      8 │ a.out

Summary

The summary gives a quick overview and implies hints for action. It appears at the end of the report because this position makes it easy to find.