Logo

HPC @ Uni.lu

High Performance Computing in Luxembourg

SLURM overview

SLURM Workload Manager - is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Optional plugins can be used for accounting, advanced reservation, gang scheduling (time sharing for parallel jobs), backfill scheduling, topology optimized resource selection, resource limits by user or bank account, and sophisticated multifactor job prioritization algorithms.

SLURM is used in the majority of the Top500 world supercomputers and also many smaller, academic centers. Due to its flexibility, speed and constant improvement, it has been chosen as the default batch scheduler on the new clusters part of the UL HPC platform, replacing OAR.

The official documentation and a helpful printable cheatsheet are referenced at the bottom of this page.

We maintain pages dedicated to examples specific for the iris cluster, linked below:

Key concepts

SLURM manages user jobs which have the following key characteristics:

  • set of requested resources:
    • number of computing resources: nodes (including all their CPUs and cores) or CPUs (including all their cores) or cores
    • amount of memory: either per node or per (logical) CPU
    • (wall)time needed for the user’s tasks to complete their work
  • a requested node partition (job queue)
  • a requested quality of service (QoS) level which grants users specific accesses
  • a requested account for accounting purposes

By default users submit jobs to a particular partition (marked as such for all users), and under a particular account (pre-set per user). However, users are always required to explicitly request the desired QOS.

For a thorough description of job specification options check out man sbatch on the cluster or browse its manual online. All SLURM tools are fully documented within their individual manuals (man $command) and you are encouraged to read them for advanced usage options.

The particular configuration of the iris cluster is detailed below.

Basic usage commands

  • squeue: view queued jobs information
  • sinfo: view queue, partition and node / node features information
  • sbatch: submit job for batch (scripted) execution
  • srun: submit interactive job, run (parallel) job step
  • scancel: cancel queued jobs
  • scontrol: detailed control and information on jobs, queues, partitions
  • sstat: view system-level utilization (memory, I/O, energy) for running jobs / job steps
  • sacct: view system-level utilization for completed jobs / job steps (accounting database)
Action SLURM command
Submit passive/batch job sbatch $script
Start interactive job srun --pty bash -i
Attach to running job sjoin $jobid [$node]
Queue status squeue
User (own) jobs status squeue -u $USER
Specific job status (detailed) scontrol show job $jobid
Job accounting status (detailed) sacct --job $jobid -l
Job efficiency report seff $jobid
Delete (running/waiting) job scancel $jobid
Hold job scontrol hold $jobid
Resume held job scontrol release $jobid
Node list and their properties scontrol show nodes
Partition list, status and limits sinfo
License list and used/free status scontrol show licenses
  • Basic options for sbatch and srun commands:
Action sbatch/srun option
Request $n distributed nodes -N $n
Request $m memory per node --mem=$mGB
Request $mc memory per core (logical cpu) --mem-per-cpu=$mcGB
Request job walltime --time=d-hh:mm:ss
Request $tn tasks per node --ntasks-per-node=$tn
Request $ct cores per task (multithreading) -c $ct
Request $nt total # of tasks -n $nt
Request $g # of GPUs per node --gres=gpu:$g
Request to start job at specific $time --begin $time
Request t tokens from specific $license -L $license:$t
Specify specific node $feature -C $feature
Specify job name as $name -J $name
Specify job partition -p $partition
Specify account -A $account
Specify email address --mail-user=$email
Request email on event --mail-type=all[,begin,end,fail]
Use the above actions in a batch script #SBATCH $option

Note: The elements given above as $name are meant to be replaced by you with the appropriate name (e.g. vplugaru instead of $user and #SBATCH -N 10 instead of #SBATCH $option).

  • SLURM environment variables you can use within your job:
Description Environment variable
Job ID $SLURM_JOBID
Job name $SLURM_JOB_NAME
Name of account under which job runs $SLURM_JOB_ACCOUNT
Job submission directory $SLURM_SUBMIT_DIR
Number of nodes assigned to the job $SLURM_NNODES
Name of nodes assigned to the job $SLURM_JOB_NODELIST
Number of cores of the job $SLURM_NPROCS
Number of cores per node for the job $SLURM_JOB_CPUS_PER_NODE
Task ID assigned within a job array $SLURM_ARRAY_TASK_ID

Note: The complete list of environment variables can be found in the dedicated section of man sbatch or online.

Deciding where and how to run your job

  • The following factors will influence where and how you should run your job:
    1. execution type: if you are developing code, testing out things (interactive job) OR starting a production run (batch job)
    2. parallelism type: if your application is serial (no inherent parallelism however parametric executions are possible) or parallel (shared-memory or distributed-memory)
    3. the required processing time: use the queue which allows a job walltime long enough for your task to finish (however we always encourage short walltime jobs and use of checkpoint-restart capabilities)

Advanced usage

SLURM has many advanced features and this page only briefly discusses the most common ones. For a complete overview, there’s nothing better than consulting the references.

Job arrays

Job arrays make parametric executions easy by enabling the user to submit a single job which spawns several (to many) individual jobs containing minor variations, altering their processing flow based on an unique index.

In order to take advantage of this facility, the user needs to:

  • submit the job using the --array=$start-$end srun/sbatch option (e.g. --array=1-10)
  • direct the standard output/error to array index-based files using %A (Job ID) and %a (Array ID) in the batch script header (e.g. #SBATCH -o myarrayjob_%A_%a.out)
  • direct the processing flow in the batch script based on the SLURM_ARRAY_TASK_ID environment variable (e.g. myapp --input=file.${SLURM_ARRAY_TASK_ID})

Job dependencies

Many scientific workflows require setting up a sequence of processing steps (pipeline), and SLURM enables this through job dependencies.

Start job when… sbatch/srun option
these other jobs have started -d after:$jobid1:$jobid2
these other jobs have ended -d afterany:$jobid1:$jobid2
these other jobs have ended with no errors -d afterok:$jobid1:$jobid2
these other jobs have ended with errors -d afternok:$jobid1:$jobid2
all other jobs with the same name have ended -d singleton

Notes:

  • all -d after* options support one or more job IDs, the table above shows how to use two as an example;
  • -d afterok:$jobid together with -d singleton are arguably the most useful dependency options.

Best-effort jobs

Best-effort (preemptible) jobs allow an efficient usage of the platform by filling available computing nodes until regular jobs are submitted.

As a general rule users should ensure that they track successful completion of best-effort jobs (which may be interrupted by other jobs at any time) and use them in combination with mechanisms such as Checkpoint-Restart (described below) that allow applications to stop and resume safely.

The way job preemption is handled on the iris cluster is described in the dedicated section.

Checkpoint-restart

Checkpoint-restart (C-R) is the technique where the applications’ state is stored in the filesystem, allowing the user to restart computation from this saved state in order to minimize loss of computation time e.g. when to the job reaches its allowed walltime, when software/hardware faults occur, etc.

DMTCP is a checkpoint-restart solution that works outside the flow of user applications, enabling their state to be saved without application alterations. You will find its reference quick-start documentation here.

DMTCP scripts tailored for SLURM can be found in our SLURM launchers page.

Notes:

  • C-R (application or system-levem) imposes a heavy load on the filesystem, thus may not be suitable for applications with high memory utilization, or for large experiments where the total amount of memory used across all the nodes part of the computation is high.
  • Not all applications are supported in DMTCP but the most common are: “MPI (various implementations), OpenMP, MATLAB, Python, Perl, R, and many programming languages and shell scripting languages”. You should check the issue tracker of DMTCP where known incompatibilities and problems are tracked.

Iris cluster particularities

The iris cluster is the first UL HPC cluster to use SLURM.

It has been configured with a set of partitions and QOS that enable advanced workflows and accounting, detailed in the following sections.

Node partitions

Partition # Nodes Default time Max time Max nodes/user
batch* 152 0-2:0:0 5-0:0:0 unlimited
bigmem 4 0-2:0:0 5-0:0:0 unlimited
gpu 18 0-2:0:0 5-0:0:0 unlimited
interactive 8 0-1:0:0 0-4:0:0 2
long 8 0-2:0:0 30-0:0:0 2

The batch partition is the default partition for user jobs not specifying otherwise (-p option to sbatch/srun).

The iris cluster is heterogeneous and contains nodes with different features, most importantly the processor generation. Use sinfo -l -N to discover the features which enable you to select specific sets of nodes using the -C option of sbatch/srun.

As of January 2019, the following features are available on the regular computing nodes:

Compute nodes Feature SBATCH option sbatch/srun command line
iris-001..108 broadwell #SBATCH -C broadwell sbatch -C broadwell [...]
iris-109..168 skylake #SBATCH -C skylake sbatch -C skylake [...]

As of January 2019 the iris cluster features also GPU accelerated and large memory nodes. They are all based on the Skylake-generation CPUs, and are divided in separate partitions with nodes homogeneous per partition:

Compute nodes Features SBATCH option sbatch/srun command line
iris-169..186 skylake,volta #SBATCH -p gpu sbatch -p gpu
iris-187..190 skylake #SBATCH -p bigmem sbatch -p bigmem

Quality of Service (QOS)

QoS User group Max cores Max jobs/user Description
qos-besteffort ALL no limit   QOS for preemptible jobs, requeued on preemption
qos-batch ALL 2600 100 QOS for normal usage of the batch partition
qos-bigmem ALL ALL 100 QOS for normal usage of the bigmem partition
qos-gpu ALL ALL 100 QOS for normal usage of the gpu partition
qos-interactive ALL 168 10 QOS for normal usage of the interactive partition
qos-long ALL 168 10 QOS for normal usage of the long partiton
qos-batch-### reserved   100 QOS for special usage of the batch partition
qos-interactive-### reserved   10 QOS for special usage of the interactive partition
qos-long-### reserved   10 QOS for special usage of the long partiton


Important The QOS is automatically determined by a job submission plugin based on the specified job partition but users can specify a specific desired QOS when submitting jobs with --qos option to sbatch/srun.
Numbered qos-$name-$number QOS rules are dedicated to specific projects and user groups.

Once the limits enforced by a particular QOS are reached (e.g. on #cores), jobs will wait within the queue with the QOS*Limit reason set (e.g. QOSGrpCpuLimit).

For the complete, descriptive list of possible reasons why a job is waiting in the queues see the SLURM job reason codes official documentation.

We highly encourage the use of best-effort jobs (described below) to ensure maximum resource utilization.

Accounts and permissions

Every user job runs under a group account (defined within SLURM’s accounting database), which grants access to specific QOS levels.

Each user is linked to an account named as for the user’s line manager (UL Professor or Group head). External researchers and students collaborating with UL groups on common research are also linked to an UL manager account.

At the top level, there is a group account for each of the UL faculties and Inter-disciplinary Centres, inherited by all manager accounts.

The following table describes this setup:

Account Parent Account
UL  
FSTC UL
FDEF UL
FLSHASE UL
LCSB UL
SNT UL
Professor $X $FACULTY/$IC
Group head $G $FACULTY/$IC
Researcher $R Professor $X
Researcher $R Group head $G
Student $S Professor $X
Student $S Group head $G
External collaborator $E Professor $X
External collaborator $E Group head $G

All accounts inherit the normal QOS settings defined above.

Special QOS are defined for contributors (qos-$partition-$qosnumber) to the HPC platform, granting additional/exclusive accesses to the nodes. Please contact the HPC Team if you want to become a contributor.

To check the default account your jobs will use:

  • sacctmgr show user $USER format=user%20s,defaultaccount%30s.

To see all account associations for your user and the QOS they grant access to:

  • sacctmgr list association where users=$USER format=account%30s,user%20s,qos%120s.

Submitting jobs

Regular jobs

All user jobs should specify a desired QOS and partition to submit a job to.

By default the batch partition is used if not set, but the corresponding qos-batch is still mandatory to be specified.

Also by default, a user’s line manager’s account is used for accounting purposes, if not otherwise set. However, as inter-group collaborations are possible, additional relationships can be put in place allowing users to submit jobs under a different account. This specific configuration should be discussed with the HPC Team.

Users part of an account with special QOS registered may not be able to access the global qos-$partition QOSs and should use their specific qos-$partition-$qosnumber instead.

Best-effort jobs on iris

To start jobs as best-effort, users simply need to use the dedicated QOS qos-besteffort in place of the QOS associated to the selected partition or their contribuitor QOS. Note that only the interactive and batch partitions accept qos-besteffort.

The special qos-besteffort QOS is preemptible by all other QOS, but has the advantage of not having the limitations imposed on the other QOS, such as maximum number of nodes, walltime, etc.

Best-effort jobs can be set to be automatically requeued (use the --requeue parameter to sbatch) if preempted by regular jobs, but then users should ensure that repeated executions of their submitted workflow does not have unintended results (e.g. multiple copies of result files, removal of directories containing good results, etc.).

Users are encouraged to use the best-effort mechanism in order to take advantage of the HPC platform as much as possible. Many scientific applications natively support internal state saving and restart (check their documentation!), and there is also the option of system-level Checkpoint-Restart using DMTCP as described above.

Examples of launchers can be found here:

Software licenses on iris

The iris cluster has dedicated ARM Forge and Performance Reports licenses.

They have been configured within SLURM and you are required to request license tokens for them within your jobs that use these software tools.

Your jobs asking for more tokens than currently available will wait in the queue. To see the available tokens and license names you can simply use scontrol show lic.

Examples for using licenses with SLURM are in the dedicated examples page and the launchers page.

References