User Tools

Site Tools


guide:cluster_faq

FAQ: Frequently asked questions


I can not connect to the cluster! OR Upload of file fails

→ First check whether you get any messages at all or if the system appears totally silent. If that is the case, check whether you are possibly “outside” of the LUH network. You'd then first need to connect via VPN.

→ If you can connect to the cluster from a shell (command line) via ssh, but are denied login graphically via X2Go, chances are high that you are over your quota (maximum disk space allocation) in the HOME directory and/or your grace period has expired. Use the command “checkquota” on the command line to verify this. Delete/Move files from within the ssh shell to get below your quota limits in HOME and then try again.


I get kicked out of my connection when trying to load a larger model or to compute something

→ you probably try to run something on the login nodes, which are unsuitable for computations and anything larger. Therefore, there is a limit of 1800 cpu seconds, after which processes get terminated. 1800 cpu seconds should be enough for months of shell access, but will get used up quickly when you try to calculate something on multiple cpu cores. Read the docs how the cluster is intended to be used. As a first start, check out our OOD-Portal on https://login.cluster.uni-hannover.de and submit an interactive job. It is never a good idea to try anything more than small tests, editing or a (small) compilation on a login node, since they are shared between all users (simply try the command w on the shell prompt to see how many others are working on the same machine).


When trying to connect to the cluster, I get: "Missing font family helvetica, OpenGL might have problems on this machine"

→ You are probably using X2Go, and you did not check the “Full Installation” checkmark. Please reinstall X2Go and make sure all fonts are installed.


The command checkquota tells me that I am over my quota, but where actually //are// these files?

→ use the command du -mad1 | sort -n in the file system that checkquota tells you it sees the problem. You can quickly change to the file system using the environment variables we automatically set for you, cd moves to your $HOME, cd $BIGWORK and cd $PROJECT to the other two major file systems.

Explanation: du = “check disk usage”, -m = “show megabyte units” (to enable numerical sorting afterwards), -a = “show all files, including those starting with a dot” (those are otherwise “hidden”, and some applications like e.g. Ansys use dot-directories e.g. for temporary files), -d1 = “only show files up to a depth of one (this level)” (otherwise the command would descend down the directory tree). | = pass the output of du to sort, which numerically (-n) sorts what it gets.


My jobs do not run - is there something wrong with the cluster?

→ try these two commands:

showq

squeue

If you see several of your jobs using showq, but none with squeue, or if you did submit your jobs using qsub instead of sbatch/salloc, you are trying to use the old PBS/Torque/Maui system, which has been gradually phased out over the past year (2021/2022). Since June 2022, the last useful resources have been migrated to Slurm, so now it's definitively time to migrate your jobs. It' usually a simple question of replacing your #PBS directives by their #SLURM counterparts.

The same goes for the Open OnDemand-portal; weblogin.cluster.uni-hannover.de points to the obsolete batch system, whereas login.cluster.uni-hannover.de directs you to the current scheduler, Slurm.


I need millions of files in BIGWORK for my project, please increase my quota!

→ No.

Unfortunately, we can't do that. Lustre in the current configuration is good scaling large bandwidths, but very weak with many very small files, since that is something the metadata server (MDS) of the file system has to carry alone. This file system is just not suitable for many small files (much file system hardware is severely bottlenecked when it comes to IOPs, but we hope the situation will get better with the next filesystem). The result would be both slow performance and also decreased overall system stability. A possible workaround is to pack all the small files into one larger dataset, either with a suitable format like HDF5, or using good old tar to create a large and possibly striped file on BIGWORK, then quickly extract that file within your job to a hopefully fast local scratch SSD drive directory.

I get a message about the (external) email adress I use to get status mails about my jobs

→ Sending lots of similar emails to external freemail providers leads to the uni-hannover.de Domain being blocked by those sites, which in turn leads to almost nobody external reveiving mail from LUH accounts or machines any more. Therefore, we ask you to use your institute's email account (the one ending in @xxxx.uni-hannover.de), and we also ask you to NOT setup an automatic forward to a Google, GMX or similar external freemailer account.


What is the meaning of "Script is written in DOS/Windows text format"?

My jobs have crashed, and the job status says something about "out of memory"

→ Use the command

sacct -j <myjobid> -o JobID,NodeList,ReqMem,AveRSS,MaxRSS,MaxVMSize,State

to find out how much memory you requested in your job and how much memory your job tried to use. If the number under MaxRSS is near or even above ReqMem, you should request more memory in your job, plus a little reserve.

VM = virtual memory. That usually is the memory that is allocated (but not necessarily used).

RSS = resident set size, that is memory that is allocated AND at least somehow used — usually in pieces of 4 kB, which is the standard memory page on Linux/x64.

When requesting memory, try to always leave at least 4GB of memory to the operating system on each node your job will run on. You will also benefit if the system has memory for buffers etc. The table of available ressources shows the physical memory, of which you'll need to subtract these 4GB.

Attention: the sacct command above only tells you about the system memory usage of your job. If you run into OOM on a graphics card, you'll still have to adapt your usage, but that will not show using sacct.

If some of your jobs survive despite demanding too much memory, it is due to the limits not being enforced every few microseconds, since that would cost performance as well. So be kind to your colleagues on the same system. Due to the fact that things run in real-time, they may very well be the first victims at times if you occupy more than your share. So please take OOM seriously, such an event may even lead to a system crashing quite unexpectedly. See https://lwn.net/Articles/104185/ (do not read if you are afraid of flying, and rest assured that things do not work like that in airplanes).


My jobs sometimes crash, depending on which partition they run on, and I see "illegal instruction messages" in my job logs

→ This may be due to you trying to run software that was compiled to use newer cpu features (like AVX512) than are available an the partition the job ran on — like e.g. the software was compiled for skylake, but the job ran on haswell. Use the command lcpuarchs -vv on a login node of the cluster to check the cpu generations available.

When compiling software yourself, keep in mind that we have several login nodes from various cpu generations. See #build_software_from_source_codeModules &amp; Application Software, Build Software from Source Code. The cpu architecture of the login node you compile on may automatically determine which platform your installer/make/… compiles for — some packages assume you want to install the software on the same architecture you want to run on, and this may of course fail if you use a newer login node to compile/install than the partition your jobs later run on. hostname shows you the name of the node you are on, lscpu shows you details about the cpus in place.

Use #SBATCH –partition= to only allow certain partitions, and use the table under Hardware specifications of cluster compute nodes to find out which cpu generation the major partitions offered run.


My jobs do not start immediately, even when I know the ressources should be available (e.g. using an FCH partition) OR "slurmstepd: error: container_p_join: open failed [...]"

As a contribution to the urgent power saving request issued by the president of the LUH, we have switched on an energy saving option in Slurm which shuts down nodes that would be idle for more than a few minutes. Slurm automatically boots these nodes when they are needed again. During boot, these node show the status “POWERING_UP” and jobs waiting for the nodes “CF/CONFIGURING”.

The mechanism works well for batch jobs. But when you request an interactive job via the salloc command that needs at least one powered-off node, you will experience a relatively long waiting time (5 minutes) while the nodes boot, and due to a possible bug regarding the timing of the requirements in the background, you may also get the above-mentioned failure message when a node just booted. Just repeat the command line in this case as a workaround; we try to find a better solution.

An alternative for salloc is the command srun [<ressource requests>] --pty $SHELL -l, which should not show this behaviour. Please note that the --pty $SHELL -l-part must come last, otherwise you'll get a different error.


When compiling, I get an error "unsupported option --add-needed"

That usually is a problem with using the “Gold” instead of the “BFD” linker (there's usually both flavours on a system). With gcc, the option to use is -fuse-ld=bfd.

guide/cluster_faq.txt · Last modified: 2022/12/15 12:06