The SLURM Batch Scheduler
SLURM overview
SLURM Workload Manager - is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
Additional plugins are used for accounting, advanced reservation, gang scheduling (time sharing for parallel jobs), backfill scheduling, topology optimized resource selection, resource limits by user or bank account, and sophisticated multifactor job prioritization algorithms.
SLURM is used in the majority of the Top500 world supercomputers and also many smaller, academic centers. Due to its flexibility, speed and constant improvement, it has been chosen as the default batch scheduler on the new clusters part of the UL HPC platform, replacing OAR.
The official documentation and a helpful printable cheatsheet are referenced at the bottom of this page.
We maintain pages dedicated to examples specific for the iris cluster, linked below:
Key concepts
SLURM manages user jobs which have the following key characteristics:
- set of requested resources:
- number of computing resources: nodes (including all their CPUs and cores) or CPUs (including all their cores) or cores
- amount of memory: either per node or per (logical) CPU
- (wall)time needed for the user’s tasks to complete their work
- a requested node partition (job queue)
- a requested quality of service (QoS) level which grants users specific accesses
- a requested account for accounting purposes
By default users submit jobs to a particular partition (marked as such for all users), and under a particular account (pre-set per user). However, users are always required to explicitly request the desired QOS.
For a thorough description of job specification options check out man sbatch
on the cluster or browse its manual online.
All SLURM tools are fully documented within their individual manuals (man $command
) and you are encouraged to read them for advanced usage options.
The particular configuration of the iris cluster is detailed below.
Basic usage commands
- squeue: view queued jobs information
- sinfo: view queue, partition and node / node features information
- sbatch: submit job for batch (scripted) execution
- srun: submit interactive job, run (parallel) job step
- scancel: cancel queued jobs
- scontrol: detailed control and information on jobs, queues, partitions
- sstat: view system-level utilization (memory, I/O, energy) for running jobs / job steps
- sacct: view system-level utilization for completed jobs / job steps (accounting database)
Action | SLURM command |
---|---|
Submit passive/batch job | sbatch $script |
Start interactive job | srun -p interactive --qos debug --pty bash -i |
Attach to running job | sjoin $jobid [$node] |
Queue status | squeue |
User (own) jobs status | squeue -u $USER |
Specific job status (detailed) | scontrol show job $jobid |
Job accounting status (detailed) | sacct --job $jobid -l |
Job efficiency report | seff $jobid |
Delete (running/waiting) job | scancel $jobid |
Hold job | scontrol hold $jobid |
Resume held job | scontrol release $jobid |
Node list and their properties | scontrol show nodes |
Partition list, status and limits | sinfo |
License list and used/free status | scontrol show licenses |
- Basic options for
sbatch
andsrun
commands:
Action | sbatch/srun option |
---|---|
Request $n distributed nodes |
-N $n |
Request $m memory per node |
--mem=$mGB |
Request $mc memory per core (logical cpu) |
--mem-per-cpu=$mcGB |
Request job walltime | --time=d-hh:mm:ss |
Request $tn tasks per node |
--ntasks-per-node=$tn |
Request $ct cores per task (multithreading) |
-c $ct |
Request $nt total # of tasks |
-n $nt |
Request $g # of GPUs per node |
--gpus-per-node=$g |
Request to start job at specific $time |
--begin $time |
Request t tokens from specific $license |
-L $license:$t |
Specify specific node $feature |
-C $feature |
Specify job name as $name |
-J $name |
Specify job partition | -p $partition |
Specify account | -A $account |
Specify email address | --mail-user=$email |
Request email on event | --mail-type=all[,begin,end,fail] |
Use the above actions in a batch script | #SBATCH $option |
Note: The elements given above as $name
are meant to be replaced by you with the appropriate name
(e.g. vplugaru
instead of $user
and #SBATCH -N 10
instead of #SBATCH $option
).
- SLURM environment variables you can use within your job:
Description | Environment variable |
---|---|
Job ID | $SLURM_JOBID |
Job name | $SLURM_JOB_NAME |
Name of account under which job runs | $SLURM_JOB_ACCOUNT |
Job submission directory | $SLURM_SUBMIT_DIR |
Number of nodes assigned to the job | $SLURM_NNODES |
Name of nodes assigned to the job | $SLURM_JOB_NODELIST |
Number of cores of the job | $SLURM_NPROCS |
Number of cores per node for the job | $SLURM_JOB_CPUS_PER_NODE |
Task ID assigned within a job array | $SLURM_ARRAY_TASK_ID |
Note: The complete list of environment variables can be found in the dedicated section of man sbatch
or online.
Deciding where and how to run your job
- The following factors will influence where and how you should run your job:
- execution type: if you are developing code, testing out things (interactive job) OR starting a production run (batch job)
- parallelism type: if your application is serial (no inherent parallelism however parametric executions are possible) or parallel (shared-memory or distributed-memory)
- the required processing time: use the queue which allows a job walltime long enough for your task to finish (however we always encourage short walltime jobs and use of checkpoint-restart capabilities)
Advanced usage
SLURM has many advanced features and this page only briefly discusses the most common ones. For a complete overview, there’s nothing better than consulting the references.
Job arrays
Job arrays make parametric executions easy by enabling the user to submit a single job which spawns several (to many) individual jobs containing minor variations, altering their processing flow based on an unique index.
In order to take advantage of this facility, the user needs to:
- submit the job using the
--array=$start-$end
srun/sbatch option (e.g.--array=1-10
) - direct the standard output/error to array index-based files using
%A
(Job ID) and%a
(Array ID) in the batch script header (e.g.#SBATCH -o myarrayjob_%A_%a.out
) - direct the processing flow in the batch script based on the SLURM_ARRAY_TASK_ID environment variable (e.g.
myapp --input=file.${SLURM_ARRAY_TASK_ID}
)
Job dependencies
Many scientific workflows require setting up a sequence of processing steps (pipeline), and SLURM enables this through job dependencies.
Start job when… | sbatch/srun option |
---|---|
these other jobs have started | -d after:$jobid1:$jobid2 |
these other jobs have ended | -d afterany:$jobid1:$jobid2 |
these other jobs have ended with no errors | -d afterok:$jobid1:$jobid2 |
these other jobs have ended with errors | -d afternok:$jobid1:$jobid2 |
all other jobs with the same name have ended | -d singleton |
Notes:
- all
-d after*
options support one or more job IDs, the table above shows how to use two as an example; -d afterok:$jobid
together with-d singleton
are arguably the most useful dependency options.
Best-effort jobs
Best-effort (preemptible) jobs allow an efficient usage of the platform by filling available computing nodes until regular jobs are submitted.
As a general rule users should ensure that they track successful completion of best-effort jobs (which may be interrupted by other jobs at any time) and use them in combination with mechanisms such as Checkpoint-Restart (described below) that allow applications to stop and resume safely.
The way job preemption is handled on the iris cluster is described in the dedicated section.
Checkpoint-restart
Checkpoint-restart (C-R) is the technique where the applications’ state is stored in the filesystem, allowing the user to restart computation from this saved state in order to minimize loss of computation time e.g. when to the job reaches its allowed walltime, when software/hardware faults occur, etc.
DMTCP is a checkpoint-restart solution that works outside the flow of user applications, enabling their state to be saved without application alterations. You will find its reference quick-start documentation here.
DMTCP scripts tailored for SLURM can be found in our SLURM launchers page.
Notes:
- C-R (application or system-levem) imposes a heavy load on the filesystem, thus may not be suitable for applications with high memory utilization, or for large experiments where the total amount of memory used across all the nodes part of the computation is high.
- Not all applications are supported in DMTCP but the most common are: “MPI (various implementations), OpenMP, MATLAB, Python, Perl, R, and many programming languages and shell scripting languages”. You should check the issue tracker of DMTCP where known incompatibilities and problems are tracked.
Iris cluster particularities
The iris cluster is the first UL HPC cluster to use SLURM.
It has been configured with a set of partitions and QOS that enable advanced workflows and accounting, detailed in the following sections.
Node partitions
Partition | # Nodes | Default time | Max time | Max nodes/user |
---|---|---|---|---|
interactive | 8 | 0-0:30:0 | 0-2:0:0 | 2 |
batch* | 168 | 0-2:0:0 | 2-0:0:0 | 64 |
gpu | 24 | 0-2:0:0 | 2-0:0:0 | 4 |
bigmem | 4 | 0-2:0:0 | 2-0:0:0 | 1 |
The batch partition is the default partition for user jobs not specifying otherwise (-p
option to sbatch/srun).
The iris cluster is heterogeneous and contains nodes with different features,
most importantly the processor generation. Use sinfo -l -N
to discover the features which enable you to select
specific sets of nodes using the -C
option of sbatch/srun.
As of October 2020, the following features are available on the regular computing nodes:
Compute nodes | Feature | SBATCH option | sbatch/srun command line |
---|---|---|---|
iris-001..108 | broadwell | #SBATCH -C broadwell |
sbatch -C broadwell [...] |
iris-109..168 | skylake | #SBATCH -C skylake |
sbatch -C skylake [...] |
As of October 2020 the iris cluster features also GPU accelerated and large memory nodes. They are all based on the Skylake-generation CPUs, and are divided in separate partitions with nodes homogeneous per partition:
Compute nodes | Features | SBATCH option | sbatch/srun command line |
---|---|---|---|
iris-169..186 | skylake,volta | #SBATCH -p gpu |
sbatch -p gpu |
iris-187..190 | skylake | #SBATCH -p bigmem |
sbatch -p bigmem |
iris-191..196 | skylake,volta,volta32 | #SBATCH -p gpu |
sbatch -p gpu |
The accelerated compute nodes feature GPUs with different on-board memory sizes.
Selecting only nodes with 32GB HBM2 on-board memory is possible by using a constraint on the volta32
feature, e.g. with:
1
|
|
Quality of Service (QOS)
QoS | User group | Max run. jobs/user | Description |
---|---|---|---|
besteffort | ALL | 100 | QOS for preemptible jobs, requeued on preemption |
normal | ALL | 10 | QOS for normal usage |
debug | ALL | 10 | QOS for normal usage of the interactive partition |
long | ALL | 1 | QOS for long jobs (max walltime: 14days) |
low | ALL | 2 | QOS for jobs with a lower priority |
high | ALL | 10 | QOS for high priority jobs (restricted) |
urgent | ALL | 100 | QOS for higher priority jobs (restricted) |
--qos
option to sbatch/srun.Once the limits enforced by a particular QOS are reached (e.g. on #cores), jobs will wait within the queue with the QOS*Limit
reason set (e.g. QOSGrpCpuLimit
).
For the complete, descriptive list of possible reasons why a job is waiting in the queues see the SLURM job reason codes official documentation.
We highly encourage the use of best-effort jobs (described below) to ensure maximum resource utilization.
Accounts and permissions
Every user job runs under a group account (defined within SLURM’s accounting database), which grants access to specific QOS levels.
Each user is linked to an account named as for the user’s line manager (UL Professor or Group head). External researchers and students collaborating with UL groups on common research are also linked to an UL manager account.
At the top level, there is a group account for each of the UL faculties and Inter-disciplinary Centres, inherited by all manager accounts.
The following table describes this setup:
Account | Parent Account |
---|---|
UL | |
FSTC | UL |
FDEF | UL |
FLSHASE | UL |
LCSB | UL |
SNT | UL |
CCDH (C2DH) | UL |
Professor $X | $FACULTY/$IC |
Group head $G | $FACULTY/$IC |
Researcher $R | Professor $X |
Researcher $R | Group head $G |
Student $S | Professor $X |
Student $S | Group head $G |
External collaborator $E | Professor $X |
External collaborator $E | Group head $G |
All accounts inherit the normal QOS settings defined above.
To check the default account your jobs will use:
sacctmgr show user $USER format=user%20s,defaultaccount%30s
.
To see all account associations for your user and the QOS they grant access to:
sacctmgr list association where users=$USER format=account%30s,user%20s,qos%120s
.
Submitting jobs
Regular jobs
All user jobs should specify a desired QOS and partition to submit a job to.
By default the batch
partition and normal
qos are used if not set.
Also by default, a user’s line manager’s account is used for accounting purposes, if not otherwise set. However, as inter-group collaborations are possible, additional relationships can be put in place allowing users to submit jobs under a different account. This specific configuration should be discussed with the HPC Team.
Best-effort jobs on iris
To start jobs as best-effort, users simply need to use the dedicated QOS besteffort
in place of the QOS associated to the selected partition or their contribuitor QOS. Note that interaction partition does not accept only the
interactive and batch partitions accept besteffort
.
The special besteffort
QOS is preemptible by all other QOS, but has the advantage of not having
the limitations imposed on the other QOS, such as maximum number of nodes, walltime, etc.
Best-effort jobs can be set to be automatically requeued (use the --requeue
parameter to sbatch) if preempted by regular jobs, but then users should ensure
that repeated executions of their submitted workflow does not have unintended results (e.g. multiple copies of result files, removal of directories containing good results, etc.).
Users are encouraged to use the best-effort mechanism in order to take advantage of the HPC platform as much as possible. Many scientific applications natively support internal state saving and restart (check their documentation!), and there is also the option of system-level Checkpoint-Restart using DMTCP as described above.
Examples of launchers can be found here:
Software licenses on iris
The iris cluster has dedicated ARM Forge and Performance Reports licenses.
They have been configured within SLURM and you are required to request license tokens for them within your jobs that use these software tools.
Your jobs asking for more tokens than currently available will wait in the queue. To see the available tokens and license names you can simply use scontrol show lic
.
Examples for using licenses with SLURM are in the dedicated examples page and the launchers page.