As you may have guessed, there are several file systems serving the cluster.
Each of these file systems has its own characteristics, making it the choice for a different workload.
Under Unix/Linux, file systems are simply mounted as “subdirectories” somewhere under the root of the file system (or “the root of the directory tree”), which is symbolized by the
/ (slash character). So, if you are coming from a Windows environment, the first thing you'll notice is that there is no such thing as drive letters (
D:, etc.). Logically, everything appears to be in the same file system. You use a different file system by simply changing into its directory. File systems may be mounted almost anywhere in the file system, not just at the top. Use your favorite search engine to learn more about how unix organizes files in case you are not yet familiar. You'll need it.
For ease of use and consistency of access, we provide automatically set variables called HOME, BIGWORK, SOFTWARE, PROJECT and TMPDIR, which either point to your own directories in the file systems that are mounted under /home, /bigwork, /software and /project, or to the scratch space local to each compute node (/scratch). You can access the content of a variable by putting a '$' sign in front of its name, like in the commands
echo $HOME or
cd $BIGWORK. To make it clear that those are variables, we'll refer to them including the dollar sign.
The following sections describe the characteristics and best utilization of each of the cluster's filesystems.
Your main working directory. It provides a comparatively high amount of storage space. Due to the fact that large amounts of data are quickly changing in this storage, it is impossible to provide a backup. All computations except for those in $TMPDIR should be done here. $BIGWORK is connected via the InfiniBand network interface and thus inherently much faster than $HOME. It is also being provided by a whole group of servers to increase performance. Both the amount of data and the number of files that you can store are limited (see below, Quota and grace time).
$BIGWORK is mounted on all nodes in the cluster (all nodes see the same $BIGWORK file system).
Your (and your colleagues') project directory. Each project gets assigned a project directory to which all members of a project have read and write access at /project/<your-cluster-groupname> (the environment-variable $PROJECT points to this location). In order to store individual user accounts’ data in this directory in such a manner that everyone can keep track of what belongs to whom, we suggest each account create their own sub directory named $PROJECT/$USER and set suitable access rights (like
mkdir -m 0700 $PROJECT/$USER). How you best organize your data within your group is left up to you, however. The group’s quota for the project directory is usually set to 10 TB.
$PROJECT is ONLY mounted on the login and the transfer nodes, but NOT on the compute nodes, since the hardware serving it is quite modest. So you CAN NOT USE this file system IN YOU JOBS (use $BIGWORK instead).
Copying between $BIGWORK and $PROJECT is thus only possible on the transfer or the login nodes, where a fast connection between both file systems is available. It is intended to separately store files that are important for the work of an institute on the cluster, for preparation, curation and long time retention of both input data, results and setups. It is configured in such a way that all data are physically written in two copies. Due to the amount of data, however, no additional backup is provided.
We regard data in $PROJECT as belonging to the institute or the project, respectively. If you want to make sure that the project management at your institute can still access your data after you have left the LUH, put a copy of your project results here, and we can easily give access to the person in charge of the project.
Please note: Backing up your data regularly from $BIGWORK to $PROJECT storage and/or to your institute’s servers is considered essential, since $BIGWORK is designed as scratch file system. $PROJECT should also NOT be considered being a real backup as well, since it is also disk based. Disks may fail, and computing centers may also suddenly burn to the ground when you least need it.
Your home directory is the directory in which you are normally placed directly after a login from the command line. You should only place small and particularly important files here that would be especially time-consuming to create anew, like templates for job setups, important scripts, configuration files for software etc. $HOME is provided by only one server which intentionally is connected via relatively slow gigabit ethernet to the remainder of the system (to protect the server from becoming overloaded). In comparison to $BIGWORK, this file system is very slow and particularly unsuitable to handle data intensive tasks. If you succeed overloading $HOME, all users will notice.
Please note: If you overstep your quota (see following section) in this directory, – which can happen very easily and quickly if you disregard or did not read our advice and let your compute jobs run there – you will make yourself unable to log in graphically with tools like X2Go, and you will then need to delete files before you can continue to work. In case you run into that situation, do not make matters worse by simply deleting “anything”, because there are important files here that e.g. give you access to parts of the system. If you indiscriminately delete everything here, you'll definitely embarass yourself and ask for administrative assistance next, which takes much longer than if you had instead taken some time to check which files were created by a job of yours. On the other hand, if you did not read this at all, we will surely point you to this section.
So you should only place important files in $HOME that would be particularly difficult or laborious to re-create after a data loss. You should, if possible, not access data in $HOME at all from within your compute jobs, since this would quite probably both result in a job that takes much longer to complete while impeding other users’ work, due to the technical properties described above. In addition, you should carefully point all directories that possibly have been set automatically by an application software - quite often those are temporary directories – to $BIGWORK. Do NOT use a symbolic link between $BIGWORK and $HOME in a compute job to creatively cheat around the circumstances. That does not help either, and the impact is almost the same as if you were just directly doing the wrong stuff. Use the environment variable $BIGWORK for convenient access in a computation, and other than that, avoid anything that is in your $HOME. Do NOT place pip, conda, CRAN/R or similar installations here, use your $SOFTWARE directory and, if possible, share between colleagues. Also have a look at the exercise.
Due to the limits in the amount of data and implicitly only slowly changing data, the system can provide a daily backup of data in $HOME.
Memory Aid: “Never WORK at HOME”. It simply is not built for that, and others will feel the impact of what you are doing. Do not abuse $HOME as a backup directory for your compute results, that's a very bad idea.
$HOME is mounted on all nodes in the cluster (all nodes see the same $HOME file system).
Within jobs, the variable $TMPDIR points to local storage available directly on each node. Whenever local storage is needed, $TMPDIR should be used.
Do not simply assume $TMPDIR to be faster than $BIGWORK – test it. $TMPDIR can be used for temporary files for applications that imperatively require a dedicated temporary directory. There will probably be a performance improvement if you have many very small files and use $TMPDIR, particularly if you are using a node that is equipped with an SSD scratch disk. Lustre (BIGWORK) is not good at handling large numbers of small files, which is the reason that the number of files is strictly limited on BIGWORK. So the only option is to pack datasets containing millions of files into a large tar file and then “locally” extract them on the fast SSD scratch disk before the job starts.
$TMPDIR is strictly local to each node. This means that if you write data to the local $TMPDIR on one node, it will NOT be visible on any other node in the cluster. If your job occupies two or more nodes, the tasks on each node will see a different local directory.
On the Dumbo partition, we have an intentionally large $TMPDIR partition (16 TB) made up of classic (mechanical) HDD drives. Most of the other nodes only have about 100 GB of local storage. You'll need to use the command
df -h $TMPDIR in your job script to find out how much exactly is available on each node type in case that is important. Some FCH partitions have explicitly been equipped with large scratch SSDs by their respective institutes due to their particular work load (many small files that can not reasonably be stored in Lustre).
Please note: As soon as a job finishes, all data stored under $TMPDIR will be deleted automatically.
This directory enables a project to install their own software in a way so the whole group can use it. You should also use this directory (or a personal subdirectory you create here) for installations done via pip, conda, R/CRAN and the like.
$SOFTWARE points to a cluster-wide shared directory that is placed in
/software/<your-cluster-groupname>. If you want to narrow down the list of usernames in your project that should have access (e.g. read) to your software installation directory, we recommend the following: create a subdirectory that only your account can access using the command:
mkdir -m 0700 $SOFTWARE/mysoft
and thereafter grant additional usernames the access you desire using an access control list (ACL):
setfacl --mask --modify user:<username>:rx,default:user:<username>:rx $SOFTWARE/mysoft
Check the ACL you set using
man setfacl and
man getfacl for more information about ACLs. Subdirectories created in $SOFTWARE can only be removed by the user who created them (the “sticky bit” is set for this directory) to prevent colleagues with no read/execute access to subdirectories from accidentally removing them.
In case you are the project manager of a project with us and your institute has special requirements (like, only a few persons should be able to install/maintain software in this directory), talk to us and we'll change ownership to the account of the project manager's account so you can completely decide for yourself whom to grant which permissions. In this case, you should take extra care that file permissions are set properly to make sure only members of your group can write into or have access to the directory so the executables you'll find here stay your own. Whomever you grant access may change files here, and if you grant permissions to “others” here and only one account on the cluster is being hacked, that may have desastrous consequences for your group.
No quota (yet) is set on this directory.The whole file system provides about 2 TiB of space shared by all users of the cluster. There's also no backup at this time. And it would be an exceptionally bad idea to abuse $SOFTWARE to store job results.
On the shared storage systems (i.e. $HOME, $BIGWORK and $PROJECT), only a fraction of the whole disk space is made available to you or your account, respectively. This amount is designated as quota. The purpose is to ensure everybody gets a share of the ressources, and also to limit the effects of accidents that otherwise could easily fill up the whole system. In case your current quota on $BIGWORK or $PROJECT would severely impede your work, let us know what you would need, stating account name, amount needed and the time frame (i.e. for how many months you would need an increased quota).
There is a soft quota and a hard quota. For each of these two quota types, two independent parameters are set, namely block quota (amount of gigabytes you can store) and inode quota (number of files you can create). So there are four values you need to consider.
The hard quota is the absolute upper bound which you can not exceed until an administrator raises the limit. The soft quota, on the other hand, may be exceeded for some time, the so-called grace time.
Exceeding any of the two parameters (block or inodes) of your soft quota starts the respective grace timer.
During the grace time, you are allowed to exceed your soft quota up to the limits of your respective hard quota. After the grace timer has run out, you will not be able to store any new data, unless you reduce disk space usage below the soft quota. Since we get asked from time to time: no, the system will NOT delete random files of yours after the grace period has expired. What is on disk stays on disk. You will just not be able to write new data as long as you are over the soft limit when the grace period is over.
As soon as your disk consumption falls below the soft quota again, the grace time counter for that file system and that parameter is reset. On $HOME and $BIGWORK, quotas are valid for each individual user, whereas on $PROJECT, quotas are accounted for the whole group (the project you are a member of, usually a four-letter code followed by a five-digit number; use the
id command to check).
For both $BIGWORK and $PROJECT, the block grace time is 3 weeks, whereas the inode grace time is 5 weeks. On $HOME, both grace periods are set to 1 week.
The quota mechanism protects users and system against possible errors of others, limits the maximum disk space available to an individual user, and keeps the system performance as high as possible by avoiding unneccessary clutter. In general, we ask you to delete files which are no longer needed to keep the file systems fast. Low disk space consumption is especially helpful on $BIGWORK in order to optimise system performance. You can query your disk space usage and quota with the command
checkquota – see also exercise. In case the standard system quota values are too low for you, talk to us and tell us which account needs how many GB for how long (and why) - we try to accommodate requests within reason. The quota values are set such that usually 90 % of our users should not feel much of limits, except for the usual need to periodically clean up and backup results.
Please note: If your quota is exhausted on $HOME, you will not be able to login graphically using X2Go any more. Connecting using ssh (without -X) will still be possible.
Please note: All statements made in this section also apply to $PROJECT storage.
On the technical level, $BIGWORK is comprised of multiple components which make up the storage system. Generally speaking, using $BIGWORK without changing any default values should usually work well. However, it may be useful under certain circumstances to change the so called stripe count, in particular if you need to handle files larger than at least 1 GB, or if you access different parts of the same file in highly parallel applications running on several nodes. Balancing I/O, investing some time in understanding Lustre and testing different setups may result in a higher performance and a better-balanced use of the overall system, which in turn is beneficial for all users.
Data on Lustre-based file systems such as $BIGWORK is saved on OSTs, Object Storage Targets. The location of the data (which OST(s)?) is registered with a directory server (“MDS”) and stored on a MDT, the Meta Data Target. Each OST and the MDT as well looks like one large hard drive (a “block device”) to the system. In reality, each Target consists of several hard drives bundled together in the background by a storage controller to ensure e.g. some fault tolerance and higher performance, but this is not visible to the computers using the file system. The first takeaway message here is that Lustre stores files on several things that look like separate hard drives, which may be taken into consideration when you need more performance, but together the whole thing just looks like another file system in a subdirectory ($BIGWORK in this case).
By default, files are written to a single OST each, regardless of their size. This corresponds to a stripe count of one. The stripe count determines how many OSTs will be used to store data. Figuratively speaking, it determines in how many stripes a file is being split (like a Zebra, but possibly with more than just two colors…). Splitting data over multiple OSTs can increase access speeds, since the read and write speeds of several OSTs and thus a higher number of hard drives are used in parallel. At the same time, one should only distribute large files in this way, because access times can also increase if you have too many small requests to the file system. Depending on your personal use case, you may need to experiment to find the best setting. When doing this, consider that the current workload of the whole system may influence your results.
Please note: If you are working with files larger than 1 GB, and for which access times e.g. from within parallel computations could significantly contribute to the total duration of a compute job, please consider setting stripe count manually according to section.
Stripe count is set as an integer value representing the number of OSTs to use, with -1 indicating all available OSTs. It is advised to create a directory below $BIGWORK and set a stipe count of -1 for it. This directory can then be used e.g. to store all files that are larger than 100 MB. For files significantly smaller than 100 MB, the default stripe count of one is both better and sufficient.
Please note:In order to alter the of existing files, these need to be copied, see section. Simply moving files with
mv is not sufficient in this case.
In this section we consider several examples on working with the cluster storage systems.
# where are you? lost? print working directory! pwd # change directory to your bigwork/project/home directory cd $BIGWORK cd $PROJECT cd $HOME # display your home, bigwork & project quota checkquota # make personal directory in your group's project storage # set permissions (-m) so only your account can access # the files in it (0700) mkdir -m 0700 $PROJECT/$USER # copy the directory mydir from bigwork to project cp -r $BIGWORK/mydir $PROJECT/$USER
# get overall bigwork usage, note different fill levels lfs df -h # get current stripe settings for your bigwork lfs getstripe $BIGWORK # change directory to your bigwork cd $BIGWORK # create a directory for large files (anything over 100 MB) mkdir LargeFiles # get current stripe settings for that directory lfs getstripe LargeFiles # set stripe count to -1 (all available OSTs) lfs setstripe -c -1 LargeFiles # check current stripe settings for LargeFiles directory lfs getstripe LargeFiles # create a directory for small files mkdir SmallFiles # check stripe information for SmallFiles directory lfs getstripe SmallFiles
Use newly created LargeFiles directory to store large files
Sometimes you might not know beforehand, how large files created by your simulations will turn out. In this case you can set stripe size after a file has been created in two ways. Let us create a 100 MB file first.
# enter the directory for small files cd SmallFiles # create a 100 MB file dd if=/dev/zero of=100mb.file bs=10M count=10 # check filesize by listing directory contents ls -lh # check stripe information on 100mb.file lfs getstripe 100mb.file # move the file into the large files directory mv 100mb.file ../LargeFiles/ # check if stripe information of 100mb.file changed lfs getstripe ../LargeFiles/100mb.file # remove the file rm ../LargeFiles/100mb.file
In order to change stripe, the file has to be copied (
cp). Simply moving (
mv) the file will not affect stripe count.
# from within the small files directory cd $BIGWORK/SmallFiles # create a 100 MB file dd if=/dev/zero of=100mb.file bs=10M count=10 # copy file into the LargeFiles directory cp 100mb.file ../LargeFiles/ # check stripe in the new location lfs getstripe ../LargeFiles/100mb.file
# create empty file with appropriate stripe count lfs setstripe -c -1 empty.file # check stripe information of empty file lfs getstripe empty.file # copy file "in place" cp 100mb.file empty.file # check that empty.file now has a size of 100 MB ls -lh # remove the original 100mb.file and work with empty.file rm 100mb.file