Table of Contents
FAQ: Frequently asked questions
Formal issues
I forgot my password!
→ Go to your project management (usually that's the people of your own institute) who will help you with that. We can not and will not reset your password, that is entirely the project management's responsibility.
I get mail from the cluster, but I am not working at the LUH any more - please remove me from the list!
→ Talk to the person at your former institute who provided you with your account credentials.
Background: accounts on the cluster are not created by us, but by the person responsible for the management of the institute's compute project with us. They should have deactivated your account when you left the institute, which also removes you from our mailing lists. In addition, the email addresses that these removal requests come from are usually NOT on any of our lists. They only get email because messages addressed to their former institute's mail accounts are being forwarded and the account has not been properly deactivated/removed.
Really frequently asked questions, connecting to the cluster, disk quota problems
I can not connect to the cluster!
→ First check whether you get any messages at all or if the system appears totally silent. If that is the case, check whether you are possibly “outside” of the LUH network. You'd then first need to connect via VPN.
→ If you can connect to the cluster from a shell (command line) via ssh, but are denied login graphically via X2Go, chances are high that you are over your quota (maximum disk space allocation) in the HOME directory and/or your grace period has expired. Use the command “checkquota” on the command line to verify this and or see the corresponding FAQ about “Disk Quota Exceeded”.
Disk Quota Exceeded errors OR Upload of file fails OR Can not create file / directory
→ Use the command “checkquota” on the command line to find out which filesystem is affected. Read the documentation about the file systems provided by the cluster and how to use them to understand which one should be used for what purpose. Then delete/move files to get below your quota limits and try again.
DO NOT simply delete files or directories in your HOME before understanding what they do, in particular DO NOT simply delete directories like .ssh, .x2go or .bashrc. Otherwise you will just shoot yourself into the other knee, too, and you will get stuck again asking for help, taking MUCH longer to solve your problem. Sit back and THINK first.
Have a look in other directories like .cache/sessions and directories that evidently look like temporary/scratch locations. You will need to analyze how you put these files here. The initial setup of a new account only needs a few kilobytes, and it is very easy to stay below a few megabytes in your HOME and run perfectly fine; so large amounts of data will have been put there by YOU, usually using conda or pip). To learn how to change the location Conda is installing to, read the corresponding section in Conda tips. Also have a look at the next FAQ, “where actually are these files?”.
The command checkquota tells me that I am over my quota, but where actually are these files?
→ use the command du -mad1 | sort -n
in the file system that checkquota tells you it sees the problem. You can quickly change to the file system using the environment variables we automatically set for you, cd
moves to your $HOME, cd $BIGWORK
and cd $PROJECT
to the other two major file systems.
Explanation:
du
= “check disk usage”, -m
= “show megabyte units” (to enable numerical sorting afterwards), -a
= “show all files, including those starting with a dot” (those are otherwise “hidden”, and some applications like e.g. Ansys use dot-directories e.g. for temporary files), -d1
= “only show files up to a depth of one (this level)” (otherwise the command would descend down the directory tree). |
= pass the output of du
to sort
, which numerically (-n
) sorts what it gets.
I need more space on BIGWORK, please increase my quota!
Please state how much storage you need, why you need it, for how much time and, of course, the account name that needs the space. We will be happy to consider reasonable requests.
The default quota is a compromise between the needs of individual users and the available disk space. The limits are set so about 90 % of our users should hopefully never encounter any difficulties. It also exists
- to prevent job scripts or users from accidentally filling up the BIGWORK directory due to a mistake,
- to ensure that people clean up their scratch files from time to time, since disk storage is a limited ressource shared by all,
- as a reminder that one needs to always make backup copies of important results and data to own storage, since it is neither possible to provide a backup for BIGWORK (due to the fast-changing nature of that file system and due to performance considerations), nor is there a guarantee that this file system will not become corrupted and loose data one day,
- to reserve some space enabling us to consider requests for more space for those users who really need it for some period of time.
I need millions of files in BIGWORK for my project, please increase my inode quota!
→ No.
Unfortunately, we can't do that. Lustre in the current configuration is good scaling large bandwidths, but very weak with many very small files, since that is something the metadata server (MDS) of the file system has to carry alone. This file system is just not suitable for many small files (much “classic” file system hardware using mechanical hard disk drives is severely bottlenecked when it comes to IOPs; we hope the situation will get better with the next filesystem in a few years, which may be comprised of SSDs that can handle orders of magnitude more IOPs). The result would be both slow performance and also decreased overall system stability. A possible workaround is to pack all the small files into one larger dataset, either with a suitable format like HDF5, or using good old tar to create a large and possibly striped file on BIGWORK. Then quickly extract that file at the start of your job, ideally to a fast local scratch SSD drive directory. For this, you'll of course need to know on which nodes/partition you will run.
My job starts, but then suddenly aborts
Use the checkquota command to verify that you are not over your soft limit (“Quota”). If you are, check that the grace period has not yet expired. Issues are usually marked in red instead of a green line. In case the grace period (the time you may overstep your soft limit) is over, you'll see a simple dash “-”, otherwise, the remaining time is displayed. The dash is only relevant if you are over quota. We recommend to also check the other quota-related FAQ.
I get kicked out of my connection when trying to load a larger model or to compute something
→ you probably try to run something on the login nodes, which are unsuitable for computations and anything larger. Therefore, there is a limit of 1800 cpu seconds, after which processes get terminated. 1800 cpu seconds should be enough for months of shell access, but will get used up quickly when you try to calculate something on multiple cpu cores. Read the docs how the cluster is intended to be used. As a first start, check out our OOD-Portal on https://login.cluster.uni-hannover.de and submit an interactive job. It is never a good idea to try anything more than small tests, editing or a (small) compilation on a login node, since they are shared between all users (simply try the command w
on the shell prompt to see how many others are working on the same machine).
When trying to connect to the cluster, I get: "Missing font family helvetica, OpenGL might have problems on this machine"
→ You are probably using X2Go, and you did not check the “Full Installation” checkmark. Please reinstall X2Go and make sure all fonts are installed.
Optimizing your work
My jobs do not run, even though some nodes appear to be free - there must be something wrong with the cluster!
→ Check your ressource requests and try to submit a very small and very short (!) job (ex. 1 cpu core, 2 GB mem, 10 minutes, like salloc –nodes=1 –ntasks=1 –mem=2GB –time=10:00
) to check whether the “problem” could be that the ressources of the cluster simply are not infinitely large or whether there is a future reservation preventing your job from starting.
The cluster is a shared tool for hundreds of users. If you submit a job that requests e.g. 20 cores for four days, it will wait until the ressources are available AND until its priority is sufficient so it becomes the next job for the scheduler to start. If you submit 20 Jobs each requesting 20 cores, they will ALL wait until the ressources for at least one of the jobs are free. Depending on total system load that may take from a few minutes or several days, usually a job like the one above (20 cores) may take several hours until it can start. We have no influence on how many jobs our users submit, but the scheduler will try to always at least run at least one job of every user before turning to the next job of the same user.
That a particular node matching the requirements of your job appears to be free at a certain time does NOT automatically mean that the scheduler will pick or even consider your job as the next to run. As always, there's many factors to consider. One example would be that another job has a higher priority than yours (e.g. due to a longer wait time, or because you already run other jobs and that user does not yet have a running job). If that other job e.g. needs some more cores than yours, it will need to wait a little longer, until the additional ressources will become free.
So it may very well appear that your job could start immediately, but since it would be unable to complete by the time available the scheduler gets the additional ressources needed to start the other job with the higher priority, it will not start.
The scheduler will only start jobs it knows are able to complete, and if your job does not fit or has less priority then someone else, it will not get started. Jobs that immediately fit and tell the scheduler they can complete in a time that is less than the additional waiting time for another job until that job gets the necessary ressources may get started outside the usual row in a mechanism called backfill, but that as well works according to priority rules as for any other job.
What can I do to get more jobs to run or to decrease wait times in the job queue?
- check whether you request an adequate wall time in your jobs (–time=…). The longer the time limit, the longer you will wait until the scheduler finds a window for your job (in general).
- if you are able to submit shorter jobs, you may fit into gaps between longer-running jobs.
- if you can submit jobs that need less than 12 hours, you may benefit from the so-called FCH clusters that we run for various institutes. They usually are reserved during office times (08.00-20.00 on work days), so they become available during the night adding a lot of extra ressources to the cluster. The same goes for jobs requesting less than 60 hours during weekends.
- if you are unhappy with the performance of a job, try running it on a specific partition and adapt it to the nodes in that partition. Check how many cores there are on a socket (
lscpu
orcat /proc/cpuinfo
on that node) or how many NUMA domains there are, respectively, and try to first “fill sockets”, then fill nodes, only thereafter request multiple nodes. - usually a job that runs on one single node will run faster than a job using the same amount of cpus distributed over several nodes due to less complexity in message passing. We see jobs that request seveal nodes, but only a few cpu cores on each node. That is usually both inefficient and also somewhat impolite to other users, who have optimized their jobs to requested one full node. Imagine a bowl of apples, and you are taking one bite out of each apple instead of just eating one complete apple yourself while leaving the others to the other guests.
- requesting fewer cpus per node while requesting more nodes may possibly get your job started quicker, but it may then need more time to complete in total than if you were to run a job on a single node, since data transfers between nodes (inter-node) are usually slower than those within one node (intra-node). So that's usually inefficient for all.
- use lcpuarchs [-v[v]] to find out what cpu architectures the nodes have. Depending on which cpu features your computation can use, you may or may not benefit from using a newer cpu (some keywords: AVX, cache size, the flags section of lscpu). Some jobs use about the same amount of time running on an older cpu, so why not use the shorter waiting times of an older partition that is less used? Some simple tests will easily tell you whether you benefit at all from a newer cpu.
- the command
scontrol show partition [<partitionname>]
will show you which partitions are available in general or which nodes are belonging to a specific partition.scontrol show node[ <nodename>]
shows ressources available on all nodes (warning: long) or on a specific node. Of course, you can also have a look at our table. - think about what bottlenecks your job has and how many cores will still result in a speed up. Painting a small room using three workers may speed up the process, but 1000 workers in a small room will just step on each others' toes. Check Amdahl's law.
- check whether the bottleneck could be I/O (or memory) instead of cpu. “A cluster is a tool to turn a cpu-bound problem into an I/O-bound problem”, so check where your jobs take the most time.
- in case you need a GPU to run, check whether you accidentally request a specific type of GPU (like “v100”), when you just need “any” GPU. The more relaxed your constraints are, the easier it becomes for the scheduler to satisfy them, and since there are many types of GPUs in the cluster, specifying exactly one type will make you wait. much. longer.
I have access to a partition of the FCH service (nodes my institute has bought and which are reserved exclusively for agreed times), so my jobs should run. But I still have to wait. There really IS something wrong with the cluster!
→ first check whether you are working in the middle of the night. FCH reservations usually cover working days, a typical example would be from 08.00-20.00 Mo-Fr. That means that in the middle of the night or on weekends, you have no direct (absolute) priority. You still have a very high indirect priority because other users can only use your institute's partition when it's not exclusively reserved and thus during the week can not run jobs there that would take longer than 12 hours to complete. Since during daytime, your priority is absolute, and since you do not have that limit, you could, of course, have a job on your partition that runs well into the night, thus shortening the time available to others, against which they can do absolutely nothing. That means that your maximum waiting time will still usually just be a few hours (that is, if none of your colleagues with the same priority submits jobs to the same partititon).
→ check whether there could be a reservation for a scheduled lecture that prevents you from accessing the reservation. The command scontrol show reservations
will help.
I get a message about the (external) email adress I use to get status mails about my jobs
→ Sending lots of similar emails to external freemail providers leads to the uni-hannover.de Domain being blocked by those sites, which in turn leads to almost nobody external reveiving mail from LUH accounts or machines any more. Therefore, we ask you to use your institute's email account (the one ending in @xxxx.uni-hannover.de), and we also ask you to NOT setup an automatic forward to a Google, GMX or similar external freemailer account.
Problems with jobs or specific software requirements
What is the meaning of "Script is written in DOS/Windows text format"?
Something in this very complex and quite expensive software-suite (that is installed on the cluster and typically used by experts) does not work as I want it, I fail to mention any details and I expect YOU to solve MY problem NOW
→ we will invariably point you to our How to get Support page.
My jobs have crashed, and the job status says something about "out of memory"
→ Use the command
sacct -j <myjobid> -o JobID,NodeList,ReqMem,AveRSS,MaxRSS,MaxVMSize,State
to find out how much memory you requested in your job and how much memory your job tried to use. If the number under MaxRSS is near or even above ReqMem, you should request more memory in your job, plus a little reserve.
VM = virtual memory. That usually is the memory that is allocated (but not necessarily used).
RSS = resident set size, that is memory that is allocated AND at least somehow used — usually in pieces of 4 kB, which is the standard memory page on Linux/x64.
When requesting memory, try to always leave at least 4GB of memory to the operating system on each node your job will run on. You will also benefit if the system has memory for buffers etc. The table of available ressources shows the physical memory, of which you'll need to subtract these 4GB.
Attention: the sacct command above only tells you about the system memory usage of your job. If you run into OOM on a graphics card, you'll still have to adapt your usage, but that will not show using sacct.
If some of your jobs survive despite demanding too much memory, it is due to the limits not being enforced every few microseconds, since that would cost performance as well. So be kind to your colleagues on the same system. Due to the fact that things run in real-time, they may very well be the first victims at times if you occupy more than your share. So please take OOM seriously, such an event may even lead to a system crashing quite unexpectedly. See https://lwn.net/Articles/104185/ (do not read if you are afraid of flying, and rest assured that things do not work like that in airplanes).
My jobs sometimes crash, depending on which partition they run on, and I see "illegal instruction messages" in my job logs
→ This may be due to you trying to run software that was compiled to use newer cpu features (like AVX512) than are available an the partition the job ran on — like e.g. the software was compiled for skylake, but the job ran on haswell. Use the command lcpuarchs -vv
on a login node of the cluster to check the cpu generations available.
When compiling software yourself, keep in mind that we have several login nodes from various cpu generations. See #build_software_from_source_code Modules & Application Software, Build Software from Source Code. The cpu architecture of the login node you compile on may automatically determine which platform your installer/make/… compiles for — some packages assume you want to install the software on the same architecture you want to run on, and this may of course fail if you use a newer login node to compile/install than the partition your jobs later run on. hostname
shows you the name of the node you are on, lscpu
shows you details about the cpus in place.
Use #SBATCH –partition=
to only allow certain partitions, and use the table under Hardware specifications of cluster compute nodes to find out which cpu generation the major partitions offered run.
My jobs do not start immediately, even when I know the ressources should be available (e.g. using an FCH partition)
As a contribution to power saving requests issued by the president of the LUH, we use an energy saving option in Slurm that automatically shuts down nodes that would be idle for more than a few minutes. Slurm boots these nodes once they are needed. During boot, these node show the status “POWERING_UP” and jobs waiting for the nodes receive the state “CF/CONFIGURING”.
When compiling, I get an error "unsupported option --add-needed"
That usually is a problem with using the “Gold” instead of the “BFD” linker (there's usually both flavours on a system). With gcc, the option to use is -fuse-ld=bfd.
A software package I use requires a newer libc (or another software component central to the running operating system) , but when I want to install it as root, I only get "you are not in sudoers". Please give me root acces/Please install that library!
→ No.
We have a highly modular software stack, and it is absolutely impossible to accommodate particular needs for particular versions of operating system libraries. To avoid everybody constantly changin the systems, we limit what you can do. Installing a different version of a very common library would break quite a lot of things, so that is also completely out of the question. But you usually can use a container to get your software to run, for details please refer to Modules & Application Software, section Singularity Containers.