Please note: SLURM currently only manages GPU and Amo sub-cluster nodes. If you want to use other sub-clusters, please refer to PBS .
The scientific computing team at the Leibniz Universität IT-Services (LUIS) is currently preparing to switch all cluster computing systems from the software package that has been used for the past 15 years – Torque/Maui – to a more modern system, the SLURM computing resource manager. The transition will take place by gradually moving parts of the existing system to the new scheduler. In a first step, SLURM will manage only the GPGPU components of the cluster. However, the complete cluster - including Forschungscluster-Housing nodes - has to be migrated to SLURM until the end of this year (2020). Over the course of the next months, we will integrate information about usage and concepts of the new system into the usual introductory presentations. News – as usual – will be announced via the cluster news mailing list.
SLURM (Simple Linux Utility for Resource Management) is a free open-source batch scheduler and resource manager that allows users to run their jobs on the LUIS compute cluster. It is a modern, extensible batch system that is installed around the world on many clusters of various sizes. This chapter describes the basic tasks necessary for submitting, running and monitoring jobs under the SLURM Workload Manager on the LUIS cluster. More detailed information about SLURM is provided by the official SLURM website.
The following commands are useful to interact with SLURM:
Below some usage examples for these commands are provided. For more information on each command, refer to the corresponding manual pages, e.g.,
man squeue or – of course – to the SLURM manual’s website.
In SLURM, compute nodes are grouped in partitions. Each partition can be regarded as an independent queue, even though a job may be submitted to multiple partitions, and a compute node may belong to several partitions simultaneously. Jobs are the allocated resources within a single partition for executing tasks on the cluster for a specified period of time. The concept of “job steps” is used to execute several tasks simultaneously or sequentially using the
srun command within a job.
Table below lists the currently defined partitions and their parameter constraints. The limits shown can not be be overruled by users.
|Partition name||Max Job Runtime||Max Nodes Per Job||Max CPUs per User||Default Runtime||Default Memory per CPU||Shared Node Usage|
|gpu||24 hours||1||1 hour||1600 MB||yes|
|amo||200 hours||800||24 hours||4000 MB||yes|
To control the job workload on the cluster and keep SLURM responsive, we enforce the following restrictions regarding the number of jobs:
|SLURM limits||Max Number of Running Jobs||Max Number of Submitted Jobs|
In case you need custom limits for a certain time, please submit a request containing a short explanation to email@example.com. Based on available resources and in keeping with maintaining a fair balance between all users, we may be able to accommodate special needs for a limited time.
To list the job limits relevant for you, use the
sacctmgr command. For example:
sacctmgr -s show user sacctmgr -s show user format=user,account,maxjobs,maxsubmit,maxwall,qos
Up-to-date information on all available nodes may be obtained using the following commands:
sinfo -Nl scontrol show nodes
Information on available partitons and their configuration:
sinfo -s scontrol show partitions
Batch submission is the most common and most efficient way to use the computing cluster. Interactive jobs are also possible; they may be useful for things like:
You can start an interactive session on a compute node either with
srun. The following example submits an interactive job using
srun that requests two tasks (this corresponds to two CPU cores) and 4 GB memory per core for an hour:
[user@login02 ~]$ srun --time=1:00:00 --ntasks=2 --mem-per-cpu=4G --x11 --pty $SHELL -l srun: job 222 queued and waiting for resources srun: job 222 has been allocated resources [user@euklid-n001 ~]$
Once the job starts, you will get an interactive shell on the first compute node (
euklid-n001 in the example above) that has been assigned to the job. The option
–x11 sets up X11 forwarding on this first node enabling the use of graphical applications. The interactive session is terminated by exiting the shell.
An interactive session with GPU resources has to be started using the command
salloc. The following example allocates two GPUs per node for 2 hours:
[user@login02 ~]$ salloc --time=2:00:00 --gres=gpu:2 salloc: Granted job allocation 228 salloc: Waiting for resource configuration salloc: Nodes euklid-n002 are ready for job
Once an allocation has been made, the
salloc command will start a shell on the login node where the submission was done. To start your application on the assigned compute nodes (
euklid-n002 in this example), you either execute the
srun command on the login shell:
[user@login02 ~]$ module load my_module [user@login02 ~]$ srun ./my_program
… or connect to the allocated compute nodes by
[user@login02 ~]$ echo $SLURM_NODELIST # assigned compute node(s) euklid-n002 [user@login02 ~]$ ssh euklid-n002 [user@euklid-n002 ~]$ module load my_module [user@euklid-n002 ~]$ ./my_program
To terminate the session, type
exit on the login shell:
[user@login02 ~]$ exit exit salloc: Relinquishing job allocation 228 salloc: Job allocation 228 has been revoked.
An appropriate SLURM job submission file for your job is a shell script with a set of directives at the beginning of the file. These directives are issued by starting a line with the string
#SBATCH. A suitable batch script is then submitted to the batch system using the
The following is an example of a simple serial job script (save the lines to the file
Note: change the
#SBATCH directives to your use case where applicable.
#!/bin/bash -l #SBATCH --job-name=test_serial #SBATCH --ntasks=1 #SBATCH --mem-per-cpu=2G #SBATCH --time=00:20:00 #SBATCH --constraint=[skylake|haswell] #SBATCH --firstname.lastname@example.org #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --output test_serial-job_%j.out #SBATCH --error test_serial-job_%j.err # Change to my work dir cd $SLURM_SUBMIT_DIR # Load modules module load my_module # Start my serial app srun ./my_serial_app
To submit the batch job, use
Note: as soon as compute nodes are allocated to your job, you can establish an
ssh connection from the login machines to these nodes.
Note: if your job uses more resources than defined with the
#SBATCH directives, the job will automatically be killed by the SLURM server.
Note: we recommend that you submit sbatch jobs with the
#SBATCH –export=NONE option to establish a clean environment, otherwise SLURM will propagate your current environment variables to the job.
The table below shows frequently used sbatch options that can either be specified in your job script with the
#SBATCH directive or on the command line. Command line options override options in the script. The commands
salloc accept the same set of options. Both long and short options are listed.
|–nodes=<N> or -N <N>||1||Number of compute nodes|
|–ntasks=<N> or -n <N>||1||Number of tasks to run|
|–cpus-per-task=<N> or -c <N>||1||Number of CPU cores per task|
|–ntasks-per-node=<N>||1||Number of tasks per node|
|–ntasks-per-core=<N>||1||Number of tasks per CPU core|
|–mem-per-cpu=<mem>||partition dependent||memory per CPU core in MB|
|–mem=<mem>||partition dependent||memory per node in MB|
|–gres=gpu:<type>:<N>||-||Request nodes with GPUs|
|–time=<time> or -t <time>||partition dependent||Walltime limit for the job|
|–partition=<name> or -p <name>||none||Partition to run the job|
|–constraint=<list> or -C <list>||none||Node-features to request|
|–job-name=<name> or -J <name>||job script’s name||Name of the job|
|–output=<path> or -o <path>||slurm-%j.out||Standard output file|
|–error=<path> or -e <path>||slurm-%j.err||Standard error file|
|–mail-user=<mail>||your account mail||User’s email address|
|–mail-type=<mode>||-||Event types for notifications|
|–exclusive||nodes are shared||Exclusive acccess to node|
To obtain a complete list of parameters, refer to the sbatch man page:
Note: if you submit a job with
–mem=0, it gets access to the complete memory of each allocated node.
By default, the stdout and stderr file descriptors of batch jobs are directed to
slurm-%j.err files, where
%j is set to the SLURM batch job ID number of your job. Both files will be found in the directory in which you launched the job. You can use the options
–error to specify a different name or location. The output files are created as soon as your job starts, and the output is redirected as the job runs so that you can monitor your job’s progress. However, due to SLURM performing file buffering, the output of your job will not appear in the output files immediately. To override this behaviour (this is not recommended in general, especially when the job output is large), you may use
–unbuffered either as an
#SBATCH directive or directly on the
sbatch command line.
If the option
–error is not specified, both stdout and stderr will be directed to the file specified by
For OpenMP jobs, you will need to set
–cpus-per-task to a value larger than one and explicitly define the
OMP_NUM_THREADS variable. The example script launches eight threads, each with 2 GiB of memory and a maximum run time of 30 minutes.
#!/bin/bash -l #SBATCH --job-name=test_openmp #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem-per-cpu=2G #SBATCH --time=00:30:00 #SBATCH --constraint=[skylake|haswell] #SBATCH --email@example.com #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --output test_openmp-job_%j.out #SBATCH --error test_openmp-job_%j.err # Change to my work dir cd $SLURM_SUBMIT_DIR # Bind your OpenMP threads export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # Intel specific environment variables export KMP_AFFINITY=verbose,granularity=core,compact,1 export KMP_STACKSIZE=64m ## Load modules module load my_module # Start my application srun ./my_openmp_app
srun command in the script above sets up a parallel runtime environment to launch an application on multiple CPU cores, but on one node. For MPI jobs, you may want to use multiple CPU cores on multiple nodes. To achieve this, have a look at the following example of an MPI job:
srun should be used in place of the “traditional” MPI launchers like
This example requests 10 compute nodes on the lena cluster with 16 cores each and 320 GiB of memory in total for a maximum duration of 2 hours.
#!/bin/bash -l #SBATCH --job-name=test_mpi #SBATCH --partition=lena #SBATCH --nodes=10 #SBATCH --ntasks-per-node=16 #SBATCH --mem-per-cpu=2G #SBATCH --time=02:00:00 #SBATCH --firstname.lastname@example.org #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --output test_mpi-job_%j.out #SBATCH --error test_mpi-job_%j.err # Change to my work dir cd $SLURM_SUBMIT_DIR # Load modules module load foss/2018b # Start my MPI application srun --cpu_bind=cores --distribution=block:cyclic ./my_mpi_app
As mentioned above, you should use the
srun command instead of
mpiexec in order to launch your parallel application.
Within the same MPI job, you can use
srun to start several parallel applications, each utilizing only a subset of the allocated resources. However, the preferred way is to use a Job Array (see section ). The following example script will run 3 MPI applications simmultaneously, each using 64 tasks (4 nodes with 16 cores each), thus totalling to 192 tasks:
#!/bin/bash -l #SBATCH --job-name=test_mpi #SBATCH --partition=lena #SBATCH --nodes=12 #SBATCH --ntasks-per-node=16 #SBATCH --mem-per-cpu=2G #SBATCH --time=00:02:00 #SBATCH --constraint=[skylake|haswell] #SBATCH --email@example.com #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --output test_mpi-job_%j.out #SBATCH --error test_mpi-job_%j.err # Change to my work dir cd $SLURM_SUBMIT_DIR # Load modules module load foss/2018b # Start my MPI application srun --cpu_bind=cores --distribution=block:cyclic -N 4 --ntasks-per-node=16 ./my_mpi_app_1 & srun --cpu_bind=cores --distribution=block:cyclic -N 4 --ntasks-per-node=16 ./my_mpi_app_1 & srun --cpu_bind=cores --distribution=block:cyclic -N 4 --ntasks-per-node=16 ./my_mpi_app_2 & wait
wait command in the script; it results in the script waiting for all previously commands that were started with $&$ (execution in the background) to finish before the job can complete. We kindly ask to take care that the time necessary to complete each subjob is not too different in order not to waste too much valuable cpu time
Job arrays can be used to submit a number of jobs with the same resource requirements. However, some of these requirements are subject to changes after the job has been submitted. To create a job array, you need to specify the directive
#SBATCH –array in your job script or use the option
-a on the
sbatch command line. For example, the following script will create 12 jobs with array indices from 1 to 10, 15 and 18:
#!/bin/bash -l #SBATCH --job-name=test_job_array #SBATCH --ntasks=1 #SBATCH --mem-per-cpu=2G #SBATCH --time=00:20:00 #SBATCH --firstname.lastname@example.org #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --array=1-10,15,18 #SBATCH --output test_array-job_%A_%a.out #SBATCH --error test_array-job_%A_%a.err # Change to my work dir cd $SLURM_SUBMIT_DIR # Load modules module load my_module # Start my app srun ./my_app $SLURM_ARRAY_TASK_ID
Within a job script like in the example above, the job array indices can be accessed by the variable
$SLURM_ARRAY_TASK_ID, whereas the variable
$SLURM_ARRAY_JOB_ID refers the the job array’s master job ID. If you need to limit (e.g. due to heavy I/O on the BIGWORK file system) the maximum number of simultaneously running jobs in a job array, use a % separator. For example, the directive
#SBATCH –array 1-50%5 will create 50 jobs, with only 5 jobs active at any given time.
Note: the maximum number of jobs in a job array is limited to 100.
SLURM sets many variables in the environment of the running job on the allocated compute nodes. Table 7.4 shows commonly used environment variables that might be useful in your job scripts. For a complete list, see the “OUTPUT ENVIRONMENT VARIABLES” section in the sbatch man page.
SLURM environment variables
|$SLURM_JOB_NUM_NODE||Number of nodes assigned to the job|
|$SLURM_JOB_NODELIST||List of nodes assigned to the job|
|$SLURM_NTASKS||Number of tasks in the job|
|$SLURM_NTASKS_PER_CORE||Number of tasks per allocated CPU|
|$SLURM_NTASKS_PER_NODE||Number of tasks per assigned node|
|$SLURM_CPUS_PER_TASK||Number of CPUs per task|
|$SLURM_CPUS_ON_NODE||Number of CPUs per assigned node|
|$SLURM_SUBMIT_DIR||Directory the job was submitted from|
|$SLURM_ARRAY_JOB_ID||Job id for the array|
|$SLURM_ARRAY_TASK_ID||Job array index value|
|$SLURM_ARRAY_TASK_COUNT||Number of jobs in a job array|
|$SLURM_GPUS||Number of GPUs requested|
The LUIS cluster has a number of nodes that are equipped with NVIDIA Tesla GPU Cards.
Currently, 4 Dell nodes containing 2 NVIDIA Tesla V100 each are available for general use in the partition
Use the following command to display the actual status of all nodes in the
gpu partition and the computing resources they provide, including type and number of installed GPUs:
$ sinfo -p gpu -NO nodelist:15,memory:8,disk:10,cpusstate:15,gres:20,gresused:20 NODELIST MEMORY TMP_DISK CPUS(A/I/O/T) GRES GRES_USED euklid-n001 128000 291840 32/8/0/40 gpu:v100:2(S:0-1) gpu:v100:2(IDX:0-1) euklid-n002 128000 291840 16/24/0/40 gpu:v100:2(S:0-1) gpu:v100:1(IDX:0) euklid-n003 128000 291840 0/40/0/40 gpu:v100:2(S:0-1) gpu:v100:0(IDX:N/A) euklid-n004 128000 291840 0/40/0/40 gpu:v100:2(S:0-1) gpu:v100:0(IDX:N/A)
To ask for a GPU resource, you need to add the directive
#SBATCH –gres=gpu:<type>:n to your job script or on the command line, respectively. Here, “n” is the number of GPUs you want to request. The type of requested GPU (<type>) can be skipped. The following job script requests 2 Tesla V100 GPUs, 8 CPUs in the
gpu partition and 30 minutes of wall time:
#!/bin/bash -l #SBATCH --job-name=test_gpu #SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --ntasks-per-node=8 #SBATCH --gres=gpu:v100:2 #SBATCH --mem-per-cpu=2G #SBATCH --time=00:30:00 #SBATCH --email@example.com #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --output test_gpu-job_%j.out #SBATCH --error test_gpu-job_%j.err # Change to my work dir cd $SLURM_SUBMIT_DIR # Load modules module load fosscuda/2018b # Run GPU application srun ./my_gpu_app
When submitting a job to the
gpu partition, you must specify the number of GPUs. Otherwise, your job will be rejected at the submission time.
Note: on the Tesla V100 nodes, you may currently only request up to 20 CPU cores for each requested GPU.
This section provides an overview of commonly used SLURM commands that allow you to monitor and manage the status of your batch jobs.
The status of your jobs in the queue can be queried using
or – if you have array jobs and want to display one job array element per line –
$ squeue -a
Note that the symbol
$ in the above commands and all other commands below represents the shell prompt. The
$ is NOT part of the specified command, do NOT type it yourself.
squeue output should look more or less like the following:
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 412 gpu test username PD 0:00 1 (Resources) 420 gpu test username PD 0:00 1 (Priority) 422 gpu test username R 17:45 1 euklid-n001 431 gpu test username R 11:45 1 euklid-n004 433 gpu test username R 12:45 1 euklid-n003 434 gpu test username R 1:08 1 euklid-n002 436 gpu test username R 16:45 1 euklid-n002
ST shows the status of your job.
JOBID is the number the system uses to keep track of your job.
NODELIST shows the nodes allocated to the job,
NODES the number of nodes requested and – for jobs in the pending state (PD) – a
TIME shows the time used by the job. Typical job states are
SUSPENDED(S). For a complete list, see the “JOB STATE CODES” section in the squeue man page.
You can change the default output format and display other job specifications using the option
-o. For example, if you want to additionally view the number of CPUs and the walltime requested:
$ squeue --format="%.7i %.9P %.5D %.5C %.2t %.19S %.8M %.10l %R" JOBID PARTITION NODES CPUS TRES_PER_NODE ST MIN_MEMORY TIME TIME_LIMIT NODELIST(REASON) 489 gpu 1 32 gpu:2 PD 2G 0:00 20:00 (Resources) 488 gpu 1 8 gpu:1 PD 2G 0:00 20:00 (Priority) 484 gpu 1 40 gpu:2 R 1G 16:45 20:00 euklid-n001 487 gpu 1 32 gpu:2 R 2G 11:09 20:00 euklid-n004 486 gpu 1 32 gpu:2 R 2G 12:01 20:00 euklid-n003 485 gpu 1 16 gpu:2 R 1G 16:06 20:00 euklid-n002
Note that you can make the
squeue output format permanent by assigning the format string to the environment variable
SQUEUE_FORMAT in your
$ echo 'export SQUEUE_FORMAT="%.7i %.9P %.5D %.5C %.13b %.2t %.19S %.8M %.10l %R"'>> ~/.bashrc
%.13b in the variable assignment for
SQUEUE_FORMAT above displays the column
TRES_PER_NODE in the squeue output, which provides the number of GPUs requested by each job.
The following command displays all job steps (processes started using
To display estimated start times and compute nodes to be allocated for your pending jobs, type
$ squeue --start JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) 489 gpu test username PD 2020-03-20T11:50:09 1 euklid-n001 (Resources) 488 gpu test username PD 2020-03-20T11:50:48 1 euklid-n002 (Priority)
A job may be waiting for execution in the pending state for a number of reasons. If there are multiple reasons for the job to remain pending, only one is displayed.
For the complete list, refer to the
squeue man page the section “JOB REASON CODES”.
If you want to view more detailed information about each job, use
$ scontrol -d show job
If you are interested in the detailed status of one specific job, use
$ scontrol -d show job <job-id>
<job-id> by the ID of your job.
Note that the command
scontrol show job will display the status of jobs for up to 5 minutes after their completion. For batch jobs that finished more than 5 minutes ago, you need to use the
sacct command to retrieve their status information from the SLURM database (see section ).
sstat command provides real-time status information (e.g. CPU time, Virtual Memory (VM) usage, Resident Set Size (RSS), Disk I/O, etc.) for running jobs:
# show all status fields sstat -j <job-id> # show selected status fields sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <job-id>
Note: the above commands only display your own jobs in the SLURM job queue.
The following command cancels a job with ID number
$ scancel <job-id>
Remove all of your jobs from the queue at once using
$ scancel -u $USER
If you want to cancel only array ID
<array_id> of job array
$ scancel <job_id>_<array_id>
If only job array ID is specified in the above command, then all job array elements will be canceled.
The commands above first send a
SIGTERM signal, then wait 30 seconds, and if processes from the job continue to run, issue a
-s option allows you to issue any signal to a running job which means you can directly communicate with the job from the command line, provided that it has been prepared for this:
$ scancel -s <signal> <job-id>
A job in the pending state can be held (prevented from being scheduled) using
$ scontrol hold <job-id>
To release a previously held job, type
$ scontrol release <job-id>
After submitting a batch job and while the job is still in the pending state, many of its specifications can be changed. Typical fields that can be modified include job size (amount of memory, number of nodes, cores, tasks and GPUs), partition, dependencies and wall clock limit. Here are a few examples:
# modify time limit scontrol update JobId=279 TimeLimit=12:0:0 # change number of tasks scontrol update jobid=279 NumTasks=80 # change node number scontrol update JobId=279 NumNodes=2 # change the number of GPUs per node scontrol update JobId=279 Gres=gpus:2 # change memory per allocated CPU scontrol update Jobid=279 MinMemoryCPU=4G # change the number of simultaneously running jobs of array job 280 scontrol update ArrayTaskThrottle=8 JobId=280
For a complete list of job specifications that can be modified, see section “SPECIFICATIONS FOR UPDATE COMMAND, JOBS” in the
scontrol man page.
sacct displays the accounting data for active and completed jobs which is stored in the SLURM database. Here are a few usage examples:
# list IDs of all your jobs since January 2019 sacct -S 2019-01-01 -o jobid # show brief accounting data of the job with <job-id> sacct -j <job-id> # display all job accountig fields sacct -j <job-id> -o ALL
The complete list of job accounting fields can be found in section “Job Accounting Fields” in the
sacct man page. You could also use the command