The SLURM Batch Scheduler
SLURM Workload Manager - is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
Additional plugins are used for accounting, advanced reservation, gang scheduling (time sharing for parallel jobs), backfill scheduling, topology optimized resource selection, resource limits by user or bank account, and sophisticated multifactor job prioritization algorithms.
SLURM is used in the majority of the Top500 world supercomputers and also many smaller, academic centers. Due to its flexibility, speed and constant improvement, it has been chosen as the default batch scheduler on the new clusters part of the UL HPC platform, replacing OAR.
The official documentation and a helpful printable cheatsheet are referenced at the bottom of this page.
We maintain pages dedicated to examples specific for the iris cluster, linked below:
SLURM manages user jobs which have the following key characteristics:
- set of requested resources:
- number of computing resources: nodes (including all their CPUs and cores) or CPUs (including all their cores) or cores
- amount of memory: either per node or per (logical) CPU
- (wall)time needed for the user’s tasks to complete their work
- a requested node partition (job queue)
- a requested quality of service (QoS) level which grants users specific accesses
- a requested account for accounting purposes
By default users submit jobs to a particular partition (marked as such for all users), and under a particular account (pre-set per user). However, users are always required to explicitly request the desired QOS.
For a thorough description of job specification options check out
man sbatch on the cluster or browse its manual online.
All SLURM tools are fully documented within their individual manuals (
man $command) and you are encouraged to read them for advanced usage options.
Basic usage commands
- squeue: view queued jobs information
- sinfo: view queue, partition and node / node features information
- sbatch: submit job for batch (scripted) execution
- srun: submit interactive job, run (parallel) job step
- scancel: cancel queued jobs
- scontrol: detailed control and information on jobs, queues, partitions
- sstat: view system-level utilization (memory, I/O, energy) for running jobs / job steps
- sacct: view system-level utilization for completed jobs / job steps (accounting database)
|Submit passive/batch job||
|Start interactive job||
|Attach to running job||
|User (own) jobs status||
|Specific job status (detailed)||
|Job accounting status (detailed)||
|Job efficiency report||
|Delete (running/waiting) job||
|Resume held job||
|Node list and their properties||
|Partition list, status and limits||
|License list and used/free status||
- Basic options for
|Request job walltime||
|Request to start job at specific
|Specify specific node
|Specify job name as
|Specify job partition||
|Specify email address||
|Request email on event||
|Use the above actions in a batch script||
Note: The elements given above as
$name are meant to be replaced by you with the appropriate
vplugaru instead of
#SBATCH -N 10 instead of
- SLURM environment variables you can use within your job:
|Name of account under which job runs||
|Job submission directory||
|Number of nodes assigned to the job||
|Name of nodes assigned to the job||
|Number of cores of the job||
|Number of cores per node for the job||
|Task ID assigned within a job array||
Note: The complete list of environment variables can be found in the dedicated section of
man sbatch or online.
Deciding where and how to run your job
- The following factors will influence where and how you should run your job:
- execution type: if you are developing code, testing out things (interactive job) OR starting a production run (batch job)
- parallelism type: if your application is serial (no inherent parallelism however parametric executions are possible) or parallel (shared-memory or distributed-memory)
- the required processing time: use the queue which allows a job walltime long enough for your task to finish (however we always encourage short walltime jobs and use of checkpoint-restart capabilities)
SLURM has many advanced features and this page only briefly discusses the most common ones. For a complete overview, there’s nothing better than consulting the references.
Job arrays make parametric executions easy by enabling the user to submit a single job which spawns several (to many) individual jobs containing minor variations, altering their processing flow based on an unique index.
In order to take advantage of this facility, the user needs to:
- submit the job using the
--array=$start-$endsrun/sbatch option (e.g.
- direct the standard output/error to array index-based files using
%A(Job ID) and
%a(Array ID) in the batch script header (e.g.
#SBATCH -o myarrayjob_%A_%a.out)
- direct the processing flow in the batch script based on the SLURM_ARRAY_TASK_ID environment variable (e.g.
Many scientific workflows require setting up a sequence of processing steps (pipeline), and SLURM enables this through job dependencies.
|Start job when…||sbatch/srun option|
|these other jobs have started||
|these other jobs have ended||
|these other jobs have ended with no errors||
|these other jobs have ended with errors||
|all other jobs with the same name have ended||
-d after*options support one or more job IDs, the table above shows how to use two as an example;
-d afterok:$jobidtogether with
-d singletonare arguably the most useful dependency options.
Best-effort (preemptible) jobs allow an efficient usage of the platform by filling available computing nodes until regular jobs are submitted.
As a general rule users should ensure that they track successful completion of best-effort jobs (which may be interrupted by other jobs at any time) and use them in combination with mechanisms such as Checkpoint-Restart (described below) that allow applications to stop and resume safely.
The way job preemption is handled on the iris cluster is described in the dedicated section.
Checkpoint-restart (C-R) is the technique where the applications’ state is stored in the filesystem, allowing the user to restart computation from this saved state in order to minimize loss of computation time e.g. when to the job reaches its allowed walltime, when software/hardware faults occur, etc.
DMTCP is a checkpoint-restart solution that works outside the flow of user applications, enabling their state to be saved without application alterations. You will find its reference quick-start documentation here.
DMTCP scripts tailored for SLURM can be found in our SLURM launchers page.
- C-R (application or system-levem) imposes a heavy load on the filesystem, thus may not be suitable for applications with high memory utilization, or for large experiments where the total amount of memory used across all the nodes part of the computation is high.
- Not all applications are supported in DMTCP but the most common are: “MPI (various implementations), OpenMP, MATLAB, Python, Perl, R, and many programming languages and shell scripting languages”. You should check the issue tracker of DMTCP where known incompatibilities and problems are tracked.
Iris cluster particularities
The iris cluster is the first UL HPC cluster to use SLURM.
It has been configured with a set of partitions and QOS that enable advanced workflows and accounting, detailed in the following sections.
|Partition||# Nodes||Default time||Max time||Max nodes/user|
The batch partition is the default partition for user jobs not specifying otherwise (
-p option to sbatch/srun).
The iris cluster is heterogeneous and contains nodes with different features,
most importantly the processor generation. Use
sinfo -l -N to discover the features which enable you to select
specific sets of nodes using the
-C option of sbatch/srun.
As of January 2019, the following features are available on the regular computing nodes:
|Compute nodes||Feature||SBATCH option||sbatch/srun command line|
As of January 2019 the iris cluster features also GPU accelerated and large memory nodes. They are all based on the Skylake-generation CPUs, and are divided in separate partitions with nodes homogeneous per partition:
|Compute nodes||Features||SBATCH option||sbatch/srun command line|
The accelerated compute nodes feature GPUs with different on-board memory sizes.
Selecting only nodes with 32GB HBM2 on-board memory is possible by using a constraint on the
volta32 feature, e.g. with:
Quality of Service (QOS)
|QoS||User group||Max cores||Max run. jobs/user||Max submit jobs/user||Description|
|qos-besteffort||ALL||no limit||400||QOS for preemptible jobs, requeued on preemption|
|qos-batch||ALL||2344||100||400||QOS for normal usage of the
|qos-bigmem||ALL||ALL||100||40||QOS for normal usage of the
|qos-gpu||ALL||ALL||100||100||QOS for normal usage of the
|qos-interactive||ALL||168||10||20||QOS for normal usage of the
|qos-long||ALL||168||10||20||QOS for normal usage of the
|qos-batch-###||reserved||100||400||QOS for reserved group usage of the
|qos-interactive-###||reserved||10||20||QOS for reserved group usage of the
|qos-long-###||reserved||10||20||QOS for reserved group usage of the
--qosoption to sbatch/srun.
Numbered qos-$name-$number QOS rules are dedicated to specific projects and user groups.
Once the limits enforced by a particular QOS are reached (e.g. on #cores), jobs will wait within the queue with the
QOS*Limit reason set (e.g.
For the complete, descriptive list of possible reasons why a job is waiting in the queues see the SLURM job reason codes official documentation.
We highly encourage the use of best-effort jobs (described below) to ensure maximum resource utilization.
Accounts and permissions
Every user job runs under a group account (defined within SLURM’s accounting database), which grants access to specific QOS levels.
Each user is linked to an account named as for the user’s line manager (UL Professor or Group head). External researchers and students collaborating with UL groups on common research are also linked to an UL manager account.
At the top level, there is a group account for each of the UL faculties and Inter-disciplinary Centres, inherited by all manager accounts.
The following table describes this setup:
|Group head $G||$FACULTY/$IC|
|Researcher $R||Professor $X|
|Researcher $R||Group head $G|
|Student $S||Professor $X|
|Student $S||Group head $G|
|External collaborator $E||Professor $X|
|External collaborator $E||Group head $G|
All accounts inherit the normal QOS settings defined above.
Special QOS are defined for contributors (
qos-$partition-$qosnumber) to the HPC platform, granting additional/exclusive accesses to the nodes. Please contact the HPC Team if you want to become a contributor.
To check the default account your jobs will use:
sacctmgr show user $USER format=user%20s,defaultaccount%30s.
To see all account associations for your user and the QOS they grant access to:
sacctmgr list association where users=$USER format=account%30s,user%20s,qos%120s.
All user jobs should specify a desired QOS and partition to submit a job to.
By default the
batch partition is used if not set, but the corresponding
qos-batch is still mandatory to be specified.
Also by default, a user’s line manager’s account is used for accounting purposes, if not otherwise set. However, as inter-group collaborations are possible, additional relationships can be put in place allowing users to submit jobs under a different account. This specific configuration should be discussed with the HPC Team.
Users part of an account with special QOS registered may not be able to access the
qos-$partition QOSs and should use their specific
Best-effort jobs on iris
To start jobs as best-effort, users simply need to use the dedicated QOS
in place of the QOS associated to the selected partition or their contribuitor QOS. Note that only the
interactive and batch partitions accept
qos-besteffort QOS is preemptible by all other QOS, but has the advantage of not having
the limitations imposed on the other QOS, such as maximum number of nodes, walltime, etc.
Best-effort jobs can be set to be automatically requeued (use the
--requeue parameter to sbatch) if preempted by regular jobs, but then users should ensure
that repeated executions of their submitted workflow does not have unintended results (e.g. multiple copies of result files, removal of directories containing good results, etc.).
Users are encouraged to use the best-effort mechanism in order to take advantage of the HPC platform as much as possible. Many scientific applications natively support internal state saving and restart (check their documentation!), and there is also the option of system-level Checkpoint-Restart using DMTCP as described above.
Examples of launchers can be found here:
Software licenses on iris
The iris cluster has dedicated ARM Forge and Performance Reports licenses.
They have been configured within SLURM and you are required to request license tokens for them within your jobs that use these software tools.
Your jobs asking for more tokens than currently available will wait in the queue. To see the available tokens and license names you can simply use
scontrol show lic.