RRZ Tools

rrz-batch-jobreport

After completion of a batch job the command

rrz-batch-jobreport job-ID

displays information on resource usage of the job specified. Resources listed are: CPU, memory, disk, network and possibly GPU and GPU-memory.

Users are asked to check at least job report summaries regularly.

Resource usage of running jobs can be checked with the rrz-batch-use command.

Motivation

Motivations for checking resource usage are

to improve resource utilisation,
to find performance bottlenecks.

Example reports

Job report examples can be seen at:

Taking advantage of job reports

Job reports can help to improve machine utilisation in the following ways:

CPU utilisation. CPUs are the most expensive components of an HPC system. Therefore one should strive for high CPU utilisation. There are two main reasons for under-utilisation of CPUs:

some CPUs are not used at all (typically as a consequence of a bad job specification)
CPUs are waiting for disk I/O operations (reasons can be that disks are heavily used by other users as well or that the program itself has an I/O bottleneck)

Memory high water mark. Peak memory usage determines whether a program fits into the memory of a compute node. If this value is known, smaller compute nodes can be used, if available. The value can also be used to estimate the maximal problem size that would fit into a given type of node.

GPU and GPU-memory. The considerations for CPUs apply here, too.

Disk usage. The reports show how much data is read from and written to the disk systems. Averages of bandwidths and some input/output operations per second (IOPS) are show as well. This information can be used for finding possible I/O bottlenecks on the various file systems:

swap is used automatically if a program is larger than main memory. Swapping activity can easily become that major source of performance degradation. On Hummel this is unlikely (but not imposible) because swap is small (such that a program is more likely to run out of memory before heavy swapping occurs).
/work is the largest file system and provides the highest bandwidth. /work works well for streaming I/O (i.e. for accessing data in large consecutive chunks). It is not well suited for random I/O or for working with very many open files (which both effctively lead to accessing data in small chunks). iowait times can indicate that a job stesses /work. In that case one should check whether /scratch ($RRZ_LOCAL_TMPDIR) can be used.
/home and /sw are the slowest file systems, which are intended to provide binaries and scripts only. However, it happened that some software read other data from there very many times such that this became a bottleneck. The solution was to copy that data to the /scratch file system at the beginning of a job (by using $RRZ_LOCAL_TMPDIR).
/scratch is the fastest but also the smallest file system. One can check how much space was used there.

Data network traffic is unlikely to become a bottleneck. For single-node jobs it gives the average bandwidth to the /work file system. For multi-node jobs it is the sum of data communication traffic between nodes and I/O traffic to /work.

Limitation

Job reports must not be confused with performance reports. Job reports only check utilisation rather than performance. The goal is to achieve good performance. Good utilisation is a necessary but not sufficient condition for performance: performance can be low even if CPU or GPU utilisation looks good. For completeness it should be mentioned that even performance is not the ultimate criterion. The latter is the shortest possible execution time (and today also the lowest energy consumption). This requires to consider algorithms in the first place. There is a similar paradox as before: shorter executions times might be achieved with a better algorithm although performance (measured in operations per second) might be lower.

It is important to know that parallel performance (good scaling) is not guaranteed if all CPU cores are busy. Therefore, explicit scaling tests are mandatory.

The job report explained

In this subsection the single-node job report example is explained section by section. Differences to multi-node job reports are explained as well. Additional entries for GPU jobs are are explained in subsection GPU job reports.

The header contains the following information: SLURM partition (which is "std" in the example below), job ID, CPU type, elapsed (wall clock) time, and the number of node-hours.

═══════════════════════════════════════════════════
RRZ: Overview of std job xxxxxxx resources (v29.16)
───────────────────────────────────────────────────
CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
───────────────────────────────────────────────────

Elapsed runtime: 17341 s (4.8 node-hours)

Data amounts

The report shows virtual memory allocation, how much data is moved from and to the file systems, and the amount of data that was transferred over the data communication network.

Maximum virtual memory allocation on a node:  14.8 GiB
          Maximum use of local scratch disk:  32.3 MiB
                 Total reads from /home /sw:  10.9 MiB
                     Total reads from /work: 339.9 GiB
                      Total writes to /work: 173.4 GiB
             Total inter-node communication:  27.6 GiB

For virtual memory allocation and local scratch disks high water marks are shown. The maximal virtual memory allocation is interesting to know if virtual memory is limited (in the past allocating more memory than physically available was possible on Hummel). If the maximum use of local scratch disk is close to the local disk space available this can be a hint to job failures.

The next three lines show how much data is being read from and written to /home, /sw, and /work. It is expected that only binary programs or scripts are read from /home and /sw which would amount to data volumes in the MiB range. Considerably larger data volumes can result in an I/O bottleneck. Data amounts on /work are the sum of regular in- and output data and scratch I/O. The amount of scratch I/O is typically unknown but can be be calculated from the numbers given in the report.

Inter-node communication raw data aggregate all data traffic including data communication (message-passing) and I/O to /work. The total inter-node communication shown in the report is the raw data amount where I/O to /work is subtracted. By definition a single-node job does not communicate data to other nodes. As a consequence the total inter-node communication should be zero. As can be seen in the example above, this is not the case in general. (At this point it should be mentioned that it is not known how accurate the numbers in the reports are. However, it is expected that they are accurate enough to detect sources of inefficient resource utilisation.) For multi-node jobs that run real parallel programs total inter-node communication is at least in the TiB-range.

Per node resource usage

The table that follows lists per node: CPU utilisation, high water marks of memory and swap usage, bandwidth and IOPS for the local disk (which contains swap and /scratch), average bandwidths to the /work filesystem and in data communication to other nodes. Below, a table for a single-node job is shown, for a multi-node job see the multi-node job report example. For multi-node jobs the table can indicate load imbalances.

Average gross compute, I/O, communication load, maximum memory use:

node    │  cpu   mem  swap   disk  iops  /work   comm
        │        GiB   GiB  MiB/s   1/s  MiB/s  MiB/s
════════╪════════════════════════════════════════════
node033 │ 15.3  14.1   0.0      0     0     30    1.6
────────┼────────────────────────────────────────────
sum:  1 │ 15.3  14.1   0.0      0     0     30    1.6

The value for CPU utilisation includes hyper-thread utilisation. The maximal value on Hummel, where hyper-threading is enabled, is twice the number of physical cores. An indicator of good CPU utilisation is a value that is close to the number of physical cores (i.e. usually 16, and 40 in partition spc). Note, that a value of 16 would also result if only half of the cores are used and hyper-threads are used on those cores. A priori, best performance is expected when all physical cores are used. Utilisation of cores can be checked with the table that is explained in the next sub-section.

The third column displays the memory high water mark (per node).

Data from columns 4 to 8 one can used to gain some understanding of I/O and inter-node communication behaviour.

Per core resource usage

This table shows CPU utilisation per physical core. Because hyper-threading is enabled the values in each column sum up to 2 (up to rounding errors). For multi-node jobs each line contains averages over the nodes for the physical core specified in columns 1 and 2. The table can be used to check that most CPU is used in user mode, and whether load is balanced at the core level.

Columns 3 to 6 contain the well known Unix/Linux CPU statistics, i.e. CPU utilisation in user and system modes, waiting times for outstanding I/O requests, and idle time. Recall that one cannot conclude that there is no I/O bottleneck if iowait is small.

Average CPU physical core usage, mean over all nodes:

socket  core │  user   sys  iowait   idle
═════════════╪═══════════════════════════
     0     0 │  1.11  0.04    0.00   0.84
     0     1 │  1.02  0.02    0.00   0.95
     0     2 │  0.98  0.02    0.00   1.00
     0     3 │  0.95  0.02    0.00   1.04
     0     4 │  0.91  0.02    0.00   1.07
     0     5 │  0.89  0.01    0.00   1.10
     0     6 │  0.90  0.02    0.00   1.10
     0     7 │  0.89  0.02    0.00   1.10
     1     0 │  0.94  0.02    0.00   1.05
     1     1 │  0.93  0.01    0.00   1.05
     1     2 │  0.91  0.01    0.00   1.08
     1     3 │  0.92  0.01    0.00   1.07
     1     4 │  0.90  0.01    0.00   1.09
     1     5 │  0.91  0.01    0.00   1.08
     1     6 │  0.93  0.01    0.00   1.05
     1     7 │  0.94  0.01    0.00   1.04
─────────────┼───────────────────────────
sum: 2    16 │ 15.05  0.23    0.00  16.69

Per command resource usage

This table shows the commands that used most resources and how often they were called (in columns tasks and procs). In many cases this will only be a single command.

Net resource usage per command:
- CPU, memory, filesystem I/O relative to whole-job CPU, memory, disk+net I/O
- maximum RSS and virtual memory
- task (thread) and process counts

 cpu   mem    io │ maxrss  maxvm │ tasks  procs │ command      
   %     %     % │    GiB    GiB │              │              
═════════════════╪═══════════════╪══════════════╪════════
95.4  84.0  32.3 │   11.0   16.6 │ 14211      1 │ a.out
 4.1   6.1  67.2 │   13.5   15.3 │   237      5 │ b.out

Summary

The summary gives a quick overview and implies hints for action. It appears at the end of the report because this position makes it easy to find also if the report appears at the end of a log file.

Summary:

    Elapsed time: 40%  (4.8 out of 12.0 h timelimit)                
             CPU: 94% (15.1 out of 16 physical CPU cores)           
    Hyperthreads:  1%  (0.2 out of 16 CPU hyperthreads)             
Max. main memory: 23% (14.1 out of 62.0 GiB min. available per node)
       Max. swap:  0%  (0.0 out of  2.0 GiB min. available per node)
──────────────────
RRZ: End Of Report

Possible actions are:

If elapsed time is small compared with the time limit that latter can/should be lowered. Shorter time limits increase the probability to find free slot for a job, which would improve the overall throughput.
Low CPU utilisation is a primary call for improvement.
Memory utilisation can tell that smaller nodes could be used or that fewer nodes could be used.

GPU job reports

Reports of jobs that ran in the gpu partition contain additional information on GPU usage. The primary goal is to check that both GPUs (CUDA devices) were used.

In the header section of report the GPU model is printed.
Data amounts are unchanged.
The per node table contains GPU and GPU-memory utilisation.
The per core table is unchanged.

A per CUDA device table follows

device │ power load │ GPU load │ memory load │ memory use
       │            │          │             │        GiB
═══════╪════════════╪══════════╪═════════════╪═══════════
     0 │       0.77 │     0.99 │        0.58 │       10.1
     1 │       0.79 │     1.00 │        0.22 │       10.1
───────┼────────────┼──────────┼─────────────┼───────────
sum: 2 │       1.56 │     1.99 │        0.79 │       20.2

where power load is the fraction of the maximal power usage, GPU load is the fractions of time during which kernels were executing on the GPU, memory load is the fraction of time during which device memory was being read or written, and memory use is the GPU-memory high water mark.

The per command table is unchanged.
In the summary GPU utilistation and GPU-memory high water mark are repeated. Here, GPU utilistation is the average power load from the table above.

Options

The rrz-batch-jobreport command has one option: --regen. It can be set to regenerate a report from raw data. Normally, the command shows a report that was generated automatically when the job ended. In particular for older jobs it can happen that the report has been improved meanwhile. Without --regen one gets the original report without improvement.