RRZ tools
rrz-batch-use
It should be kept in mind from the beginning that some sampling interval is needed for obtaining the data shown, i.e. changes can be expected within minutes rather than seconds.
The rrz-batch-use command shows resource usage data of all batch jobs of the user that are executing, for example (the columns are explained in section Table columns, GPU jobs contain two additional columns for GPU usage and GPU-memory usage):
job │ jobname │ part │ node │ cpu │ mem │ net │ runtime │ endtime │ │ │ │ │ % │ MiB/s │ h │ h ════════╪═════════╪══════╪═════════╪══════╪═════╪════════╪═════════╪════════ 1444257 │ test │ all │ node320 │ 16.0 │ 4 │ 1033.7 │ 0.0 │ 0.2 1444257 │ test │ all │ node321 │ 16.0 │ 4 │ 1071.9 │ 0.0 │ 0.2 1444257 │ test │ all │ node322 │ 16.0 │ 4 │ 1090.3 │ 0.0 │ 0.2 1444257 │ test │ all │ node323 │ 15.9 │ 4 │ 1100.8 │ 0.0 │ 0.2
Technically, rrz-batch-use is a wrapper for rrz-cluster-batchstate. The latter was written as a system administration tool that allows to look at many parameters (see section Table columns). As a user tool it is expected to enable quick resource usage checks. More explanations on resource usage are given in the description of the rrz-batch-jobreport command.
Table columns
Table columns can be added via the --fields option. --fields can be abbreviated as -F. Example (add disk operations):
rrz-batch-use --fields=dsk.ops rrz-batch-use -F=dsk.ops rrz-batch-use -F dsk.ops
The following fields can be selected with --fields:
Field | Meaning |
---|---|
cpu | accumulated CPUs activity (should be close to the number of CPU cores, can be up to twice the number of CPU cores if hyper-threads are used) |
cpu.ctx | number of context switches |
cpu.frk | number of forks |
cpu.idl | fraction of time during which CPUs are idle |
cpu.iow | fraction of time during which CPUs are waiting for outstanding I/O requests |
cpu.phy | number of CPU cores |
cpu.sys | fraction of time during which CPUs are spending in system (kernel) mode |
cpu.thr | number of CPU hyperthreads |
cpu.usr | fraction of time during which CPUs are working in user mode |
dsk.dat | disk bandwidth |
dsk.ops | disk operations per second |
dsk.rdat | disk read bandwidth |
dsk.rops | disk read operations per second |
dsk.wdat | disk write bandwidth |
dsk.wops | disk write operations per second |
endtime | time left until time limit is reached (in hours) |
endtime_fancy | time left until time limit is reached (date) |
gpu.gpu_load | faction of time during which kernels are executing |
gpu.mem | fraction of device memory used |
gpu.mem_load | fraction of time during which device memory is being read or written |
gpu.phy | number of GPUs (CUDA devices) |
gpu.pow_load | power consumption as a fraction of the power maximum |
gpu.pow_watt | power consumption in watts |
hog.bin | binary using most CPU time |
hog.cmd | complete path to hog binary |
hog.cnt | number of instance of hog binary |
hog.cpu | fraction of CPU time used by hog binary |
hog.usr | user name running hog binary |
hst.uptime | host uptime |
hst.walltime | current time (system clock) on host |
job | job ID |
jobname | job name |
jobpart | partition name |
mem.phy | fraction of main memory used |
mem.swap | fraction of swap space used |
net.dat | data communication network bandwidth |
net.ops | data communication network packets per second |
net.rdat | data communication network receive bandwidth |
net.rops | data communication network received packets per second |
net.wdat | data communication network send bandwidth |
net.wops | data communication network sent packets per second |
node | host name |
res | reservation name |
runtime | wall clock time used (in hours) |
runtime_fancy | wall clock time used (date) |
state | B(usy), D(own) F(ree) O(ffline) or R(eserved) |
user | user name |