working with data

This page explains the disk and file systems available on Hummel-2 and their usage, as well as general topics related to data handling:

Hummel-2 storage concept

On Linux PCs the /home directory usually is the only place where users keep their data. Because of the total size of data this would not be economical or even not be technically feasible on an HPC cluster. A technical limitation is the amount of data that can be backed up on a daily basis. There are suitable sophisticated disk systems, but these are expensive because they combine all desirable properties

data safety
high performance
large capacity

in a single system. Hummel-2 is equipped with disk systems that each have basically only one of these three properties.

Data security and safety

Data security

On Hummel-2 there are no security measures beyond standard Unix-like/POSIX file access permissions.

Data safety

Data can get lost as a consequence of hardware failures or human mistakes. Data backup tries to protects against both. However, data can still get lost in time between data creation an the next backup run. Typically, the RAID technology is employed to protect against disk failures. However, if too many disks in a RAID group fail at the same time data can get lost, too.

It is important to know the data safety properties of the available files systems:

Backups are only made of files in /home. The backup frequency is nightly.
/home and /usw have redundant disks at the RAID-1 level.
/beegfs has redundant disks in a Declustered RAID (dRAID).

I/O (input/output) performance basics

Software characteristics

Sequential access (accessing data in a contiguous manner, also called streaming I/O) is characterized by the measured bandwidth, i.e. the amount of data read or written per second. For achieving high bandwidth it is advantageous to read/write data in large chunks (e.g. 1 MB).
Random access is characterized by I/O operations per second (IOPS) for small data chunks (e.g. 4096 bytes).
The number of file operations per seconds, e.g. the number of file creations or deletions per second.

Hardware characteristics

Spinning/Hard disk drives (HDD) are considerably cheaper than SSDs. Systems of HDDs are powerful enough for streaming I/O which is the I/O pattern adopted in classic computer simulation software. However, their IOPS are fairly limited and they are not well suited for accessing very many small files.
Individual solid-state drives (SSD/NVMe) are much more powerful than individual HDDs. They are ideal for processing data in small chunks and for working with small files. When attached via NVMe extreme IOPS can be obtained.

I/O does not scale

Scalability is an important property of a system, like an HPC cluster, which is built from many components. Scalibilty means that N times more hardware delivers roughly N times more performance. While this is the case from some components, in particular for computer nodes, the situation is not so simple for I/O hardware. First, the performance of an indivdual disk does not scale with its capacity: typically a large disk has approximately the same performance as a smaller one. Second, only the bandwidth scales with the number of disk employed, while IOPS do not scale well, e.g. two disks working together in a single file system will approximately double the bandwidth but IOPS will remain roughly constant (to our experience).

Which file system to use when?

/home ($HOME) and /usw ($USW)

/home and /usw are actually two partitions on the same RAID-1 pair of SSDs. The are the smallest and slowest file sytems of Hummel-2. Data should not be stored here but rather on one of the other file systems

The /home file system is backup up every night. /home should be used for files like shell scripts or program code that you write yourself. The idea is to provide a reasonable level of safety for results from your personal time (in contrast to time the computer is working for you).
/usw (“User-installed SoftWare”) is not backed up. The reason why it is a separate file system is the avoidance of backup of large software packages that can easily be re-installed. Examples of such packages are all variants of conda. Please install software packages in /usw rather than in /home

Note: /home and /usw are mounted readonly on the compute nodes, i.e. they are not writable in batch jobs.

/beegfs ($BEEGFS)

/beegfs is a classic parallel file system that is based on spinning disks. On the predecessor system a BeeGFS file system was the only large file system which was used for all kinds of data processing. On Hummel-2 /beegfs is the largest file system. It is well-suited for the classic computer simulation workload. However, it should not be used for storing and processing many small files. Small files should be packed (e.g. with tar) or kept on $SSD (see below). Also, data to be processed with mostly random access should be put on $SSD (see below). /beegfs can also be used for backing up files stored on $SSD. Every user can use /beegfs. The /beegfs directory name is kept in the environment variable $BEEGFS.

Note: Disk quotas enforce an average file size of 1 MB.

/nfs/ssdX.Y ($SSD) and /nvmeof/ssdX.Y

ssdX.Y stands for an SSD from the SSD/NVMe pool. SSDs from the pool can either be exported as an NFS file system or as block devices via NVMe-oF. The export mechanism is reflected in the first part of the file system name: /nfs or /nvmeof.

These file systems are well-suited for data intensive computing tasks and for keeping many small files. Because there is no protection againts disk failures care should be taken to avoid data loss: it is assumed that

input data is always a copy of data that is stored elsewhere,
output data is copied elsewhere timely.

The differences between /nfs and /nvmeof are:

/nfs/ssdX.Y will be available on all nodes.
/nvmeof/ssdX.Y can only be available on one node. It will only be provided for very high performance demands.

At the beginning every user has a 100 GB quota in $SSD. This space can be used for working with small files or as scratch space. Please contact the HPC team if you need more SSD space,

Note: There is no protection against disk failures, i.e. if a disk fails all data stored on it is lost.

Temporary/scratch files

In our understanding the term scratch implies automatic deletion. On Hummel-2 temporary directories for batch jobs will be deleted at job end.

/tmp and /dev/shm

On each Unix-like system the /tmp directory must exist for storing temporary files. Under Linux many applications use the RAM disk /dev/shm.

In batch jobs running on Hummel-2 /tmp and /dev/shm are virtual file systems that are both kept in memory. Each batch job has its private /tmp and /dev/shm which are automatically removed at the end of the job.

Note: Usage of /tmp or /dev/shm counts as memory usage of the batch job, i.e. the job can run out of memory if to much data is written there.

For login sessions on the login and front-end nodes the procedure is similar: /tmp or /dev/shm are also kept in memory, but the same space is used for all sessions by the user running on the same node. The size of the temporary space is 50 MB. The command

$ df /tmp

displays the amount of space that is available to the user (not all users). The temporary space is deleted when the node is rebooted.

Note: On the login and front-end nodes space for /tmp or /dev/shm is small. There is no automatic clean-up.

Disk quotas

Disk quota is a means to limit file system usage:

block quota limits disk space
inode quota limits the number of files

On Hummel-2 disk quota is enabled on all disk systems. Users can check their disk usage with the RRZ tool rrz-quota.

File transfer

On your local computer files can be copied to and from Hummel-2 for example with:

The server that needs to be specified in these commands is one of the login gateway nodes.

On Hummel-2 one can use the UHHDisk, see:

file transfer to UHHDisk.

If you need to transfer data to or from another HPC cluster, please contact the HPC team.

Files can be shared by setting appropriate access permission. This is explained on the page Sharing files.