→ Go to your project management (usually that's the people of your own institute) who will help you with that. We can not and will not reset your password, that is entirely the project management's responsibility.
→ Talk to the person at your former institute who provided you with your account credentials.
Background: accounts on the cluster are not created by us, but by the person responsible for the management of the institute's compute project with us. They should have deactivated your account when you left the institute, which also removes you from our mailing lists. In addition, the email addresses that these removal requests come from are usually NOT on any of our lists. They only get email because messages addressed to their former institute's mail accounts are being forwarded and the account has not been properly deactivated/removed.
→ First check whether you get any messages at all or if the system appears totally silent. If that is the case, check whether you are possibly “outside” of the LUH network. You'd then first need to connect via VPN.
→ If you can connect to the cluster from a shell (command line) via ssh, but are denied login graphically via X2Go, chances are high that you are over your quota (maximum disk space allocation) in the HOME directory and/or your grace period has expired. Use the command “checkquota” on the command line to verify this and or see the corresponding FAQ about “Disk Quota Exceeded”.
→ Use the command “checkquota” on the command line to find out which filesystem is affected. Read the documentation about the file systems provided by the cluster and how to use them to understand which one should be used for what purpose. Then delete/move files to get below your quota limits and try again. DO NOT simply delete files or directories in your HOME before understanding what they do, in particular do not simply delete directories like .ssh, .x2go or .bashrc. Take care in other directories like .cache/sessions. You will need to analyze how you put these files here (because the initial setup of a new account only needs a few kilobytes, and it is very easy to stay below a few megabytes in your HOME and run perfectly fine; so large amounts of data will have been put there by you, usually using conda or pip). Otherwise you will shoot yourself into the other knee, too, and you will get stuck again asking for help. See also the next FAQ, “where actually are these files?”.
→ use the command
du -mad1 | sort -n in the file system that checkquota tells you it sees the problem. You can quickly change to the file system using the environment variables we automatically set for you,
cd moves to your $HOME,
cd $BIGWORK and
cd $PROJECT to the other two major file systems.
du = “check disk usage”,
-m = “show megabyte units” (to enable numerical sorting afterwards),
-a = “show all files, including those starting with a dot” (those are otherwise “hidden”, and some applications like e.g. Ansys use dot-directories e.g. for temporary files),
-d1 = “only show files up to a depth of one (this level)” (otherwise the command would descend down the directory tree).
| = pass the output of
sort, which numerically (
-n) sorts what it gets.
→ you probably try to run something on the login nodes, which are unsuitable for computations and anything larger. Therefore, there is a limit of 1800 cpu seconds, after which processes get terminated. 1800 cpu seconds should be enough for months of shell access, but will get used up quickly when you try to calculate something on multiple cpu cores. Read the docs how the cluster is intended to be used. As a first start, check out our OOD-Portal on https://login.cluster.uni-hannover.de and submit an interactive job. It is never a good idea to try anything more than small tests, editing or a (small) compilation on a login node, since they are shared between all users (simply try the command
w on the shell prompt to see how many others are working on the same machine).
→ You are probably using X2Go, and you did not check the “Full Installation” checkmark. Please reinstall X2Go and make sure all fonts are installed.
→ Check your ressource requests and try to submit a very small and very short (!) job (ex. 1 cpu core, 2 GB mem, 10 minutes, like
salloc –nodes=1 –ntasks=1 –mem=2GB –time=10:00) to check whether the “problem” could be that the ressources of the cluster simply are not infinitely large or whether there is a future reservation preventing your job from starting.
The cluster is a shared tool for hundreds of users. If you submit a job that requests e.g. 20 cores for four days, it will wait until the ressources are available AND until its priority is sufficient so it becomes the next job for the scheduler to start. If you submit 20 Jobs each requesting 20 cores, they will ALL wait until the ressources for at least one of the jobs are free. Depending on total system load that may take from a few minutes or several days, usually a job like the one above (20 cores) may take several hours until it can start. We have no influence on how many jobs our users submit, but the scheduler will try to always at least run at least one job of every user before turning to the next job of the same user.
cat /proc/cpuinfoon that node) or how many NUMA domains there are, respectively, and try to first “fill sockets”, then fill nodes, only thereafter request multiple nodes.
scontrol show partition [<partitionname>]will show you which partitions are available in general or which nodes are belonging to a specific partition.
scontrol show node[ <nodename>]shows ressources available on all nodes (warning: long) or on a specific node. Of course, you can also have a look at our table.
→ first check whether you are working in the middle of the night. FCH reservations usually cover working days, a typical example would be from 08.00-20.00 Mo-Fr. That means that in the middle of the night or on weekends, you have no direct (absolute) priority. You still have a very high indirect priority because other users can only use your institute's partition when it's not exclusively reserved and thus during the week can not run jobs there that would take longer than 12 hours to complete. Since during daytime, your priority is absolute, and since you do not have that limit, you could, of course, have a job on your partition that runs well into the night, thus shortening the time available to others, against which they can do absolutely nothing. That means that your maximum waiting time will still usually just be a few hours (that is, if none of your colleagues with the same priority submits jobs to the same partititon).
→ check whether there could be a reservation for a scheduled lecture that prevents you from accessing the reservation. The command
scontrol show reservations will help.
Unfortunately, we can't do that. Lustre in the current configuration is good scaling large bandwidths, but very weak with many very small files, since that is something the metadata server (MDS) of the file system has to carry alone. This file system is just not suitable for many small files (much file system hardware is severely bottlenecked when it comes to IOPs, but we hope the situation will get better with the next filesystem). The result would be both slow performance and also decreased overall system stability. A possible workaround is to pack all the small files into one larger dataset, either with a suitable format like HDF5, or using good old tar to create a large and possibly striped file on BIGWORK, then quickly extract that file within your job to a hopefully fast local scratch SSD drive directory.
→ Sending lots of similar emails to external freemail providers leads to the uni-hannover.de Domain being blocked by those sites, which in turn leads to almost nobody external reveiving mail from LUH accounts or machines any more. Therefore, we ask you to use your institute's email account (the one ending in @xxxx.uni-hannover.de), and we also ask you to NOT setup an automatic forward to a Google, GMX or similar external freemailer account.
→ Use the command
sacct -j <myjobid> -o JobID,NodeList,ReqMem,AveRSS,MaxRSS,MaxVMSize,State
to find out how much memory you requested in your job and how much memory your job tried to use. If the number under MaxRSS is near or even above ReqMem, you should request more memory in your job, plus a little reserve.
VM = virtual memory. That usually is the memory that is allocated (but not necessarily used).
RSS = resident set size, that is memory that is allocated AND at least somehow used — usually in pieces of 4 kB, which is the standard memory page on Linux/x64.
When requesting memory, try to always leave at least 4GB of memory to the operating system on each node your job will run on. You will also benefit if the system has memory for buffers etc. The table of available ressources shows the physical memory, of which you'll need to subtract these 4GB.
Attention: the sacct command above only tells you about the system memory usage of your job. If you run into OOM on a graphics card, you'll still have to adapt your usage, but that will not show using sacct.
If some of your jobs survive despite demanding too much memory, it is due to the limits not being enforced every few microseconds, since that would cost performance as well. So be kind to your colleagues on the same system. Due to the fact that things run in real-time, they may very well be the first victims at times if you occupy more than your share. So please take OOM seriously, such an event may even lead to a system crashing quite unexpectedly. See https://lwn.net/Articles/104185/ (do not read if you are afraid of flying, and rest assured that things do not work like that in airplanes).
→ This may be due to you trying to run software that was compiled to use newer cpu features (like AVX512) than are available an the partition the job ran on — like e.g. the software was compiled for skylake, but the job ran on haswell. Use the command
lcpuarchs -vv on a login node of the cluster to check the cpu generations available.
When compiling software yourself, keep in mind that we have several login nodes from various cpu generations. See #build_software_from_source_codeModules & Application Software, Build Software from Source Code. The cpu architecture of the login node you compile on may automatically determine which platform your installer/make/… compiles for — some packages assume you want to install the software on the same architecture you want to run on, and this may of course fail if you use a newer login node to compile/install than the partition your jobs later run on.
hostname shows you the name of the node you are on,
lscpu shows you details about the cpus in place.
#SBATCH –partition= to only allow certain partitions, and use the table under Hardware specifications of cluster compute nodes to find out which cpu generation the major partitions offered run.
As a contribution to the urgent power saving request issued by the president of the LUH, we have switched on an energy saving option in Slurm which shuts down nodes that would be idle for more than a few minutes. Slurm automatically boots these nodes when they are needed again. During boot, these node show the status “POWERING_UP” and jobs waiting for the nodes “CF/CONFIGURING”.
That usually is a problem with using the “Gold” instead of the “BFD” linker (there's usually both flavours on a system). With gcc, the option to use is -fuse-ld=bfd.