Logo

HPC @ Uni.lu

High Performance Computing in Luxembourg

Partition (queue), node and licenses status

  • Show queued jobs, show more details (‘long’ view that includes the job time limit):
1
2
squeue
squeue -l
  • Show only the queued jobs of your user ($USER is an environment variable in your shell), then for another specific user:
1
2
squeue -u $USER
squeue -u vplugaru
  • Show queued jobs in a specific partition:
1
squeue -p $partition
  • Show queued jobs that are in a specific state (pending / running / failed / preempted, see man squeue for all available states):
1
2
3
4
squeue -t PD
squeue -t R
squeue -t F
squeue -t PR
  • Show partition status, summarized status (without node state), and node-oriented partition status:
1
2
3
sinfo
sinfo -s
sinfo -N
  • Show node details including available features (to be used with the -C option of sbatch/srun):
1
sinfo -l -N
  • Show node reservations that have been created by the administrators for specific users or accounts:
1
sinfo -T
  • Show node details (all nodes, specific node):
1
2
scontrol show nodes
scontrol show nodes $nodename
  • Check the default account your jobs will use:
1
sacctmgr show user $USER format=user%20s,defaultaccount%30s`
  • See all account associations for your user and the QOS they grant access to:
1
sacctmgr list association where users=$USER format=account%30s,user%20s,qos%120s
  • See configured licenses and their status (#tokens used and free):
1
scontrol show licenses

Job submission and management

Starting interactive jobs

  • Start an interactive job with the default number of cores and walltime:
1
srun -p interactive --pty bash -i
  • Start an interactive job for 30 minutes, with 2 nodes and 4 tasks per node:
1
srun -p interactive --time=0:30:0 -N 2 --ntasks-per-node=4 --pty bash -i
  • Start an interactive job with X11 forwarding such that GUI applications (running in the cluster) will be shown on your workstation:
    • note that your initial connection to the iris cluster needs to have X11 Forwarding enabled, e.g. ssh -X iris-cluster
1
srun -p interactive --pty --x11 bash -i
  • Start a best-effort interactive job (can be interrupted by regular jobs if other users submit them):
1
srun -p interactive --qos qos-besteffort --pty bash -i
  • Start an interactive jobs asking for 8 Allinea Forge (DDT/MAP) licenses:
1
srun -p interactive -L forge:8 --pty bash -i
  • Start an interactive jobs asking for 8 Allinea Forge (DDT/MAP) licenses and 16 Allinea Performance reports licenses:
1
srun -p interactive -L forge:8,perfreport:16 --pty bash -i

Note:

  • Make interactive jobs easier to launch, add to your ~/.bashrc:
    • alias si='srun -p interactive --pty bash -i' or e.g.
    • alias si='srun -p interactive --time=0:30:0 --pty bash -i'
  • Users that are part of groups with access to dedicated QOS should use explicitly their specific QOS (e.g. --qos qos-interactive-001)

Submitting passive jobs

We maintain a page dedicated to examples of SLURM batch (launcher) scripts that you can use for your batch jobs.

  • Submit to the queue a job script (job launcher) in which you’ve added SLURM directives (#SBATCH $directive) with the job specification (name, number of requested nodes, memory, walltime, etc.):
1
sbatch job.sh
  • Submit a job script, overriding on the command line the number of requested nodes:
1
sbatch -N 2 job.sh
  • Submit a job script to the batch partition:
1
sbatch -p batch job.sh
  • Submit a job script to the long partition that permits a long walltime:
1
sbatch -p long job.sh
  • Submit a job script to the batch partition, requesting only nodes with Broadwell CPUs:
1
sbatch -p batch -C broadwell
  • Submit a job script to the batch partition, requesting only nodes with Skylake (AVX-512 ISA) CPUs:
1
sbatch -p batch -C skylake
  • Submit a job script to the gpu partition, requesting 2 cores and 2 GPUs on a single node:
1
sbatch -N 1 -n 2 --gpus=2 -p gpu job.sh
  • Submit a job script to the gpu partition, requesting 2 cores and 2 GPUs on a single node, each GPU with 32GB on-board memory:
1
sbatch -N 1 -n 2 --gpus=2 -C volta32 -p gpu job.sh
  • Submit a job script to the gpu partition, requesting 4 nodes with 2 cores/node and 4 GPUs/node:
1
sbatch -N 4 --ntasks-per-node=2 --gpus-per-node=4 -p gpu job.sh
  • Submit a job script to the bigmem partition, requesting 64 tasks (with 1 core/task) and 2TB of RAM on a single node:
1
sbatch -N 1 -n 64 --mem=2T -p bigmem job.sh
  • Submit a job script to the bigmem partition, requesting the full node (112 cores and all associated RAM, ~3TB):
1
sbatch -N 1 -n 112 -p bigmem job.sh
  • Submit a job script and request a specific start time:
    1. current day at a precise hour
    2. relative to a moment in time: now, today, tomorrow are recognized keywords, to be used together with seconds (default), minutes, hours, days, weeks time units
    3. relative to a moment in time combining time specifications
    4. specific date and hour
1
2
3
4
sbatch --begin=16:00 job.sh
sbatch --begin=tomorrow job.sh
sbatch --begin=now+2hours job.sh
sbatch --begin=2017-06-23T07:30:00 job.sh
  • Submit a best-effort job to the batch partition (can be interrupted by regular jobs if other users submit them):
1
sbatch -p batch --qos qos-besteffort job.sh

Note:

  • Users that are part of groups with access to dedicated QOS should use explicitly their specific QOS (e.g. --qos qos-batch-001, --qos qos-long-001)

Collecting job information

  • Show the details of a job:
1
scontrol show job $jobid
  • Check waiting job priority (detailed view):
1
sprio -l
  • Check expected job start time:
1
squeue --start -u $USER
  • Show running job (and steps) system-level utilization (memory, I/O, energy):
    • note that sstat information is limited to your own jobs
1
sstat -j $jobid
  • Show specific statistics from a running job (and steps) or multiple jobs:
    • use sstat -e to see a list of available output fields
1
2
sstat -j $jobid --format=AveCPU,AveRSS,AveVMSize,MaxRSS,MaxVMSize
sstat -j $jobid1,$jobid2 --format=AveCPU,AveRSS,AveVMSize,MaxRSS,MaxVMSize
  • Output the statistics in a parseable format, delimited by | (with, then without trailing |):
1
2
sstat -p -j $jobid --format=AveCPU,AveRSS,AveVMSize,MaxRSS,MaxVMSize
sstat -P -j $jobid --format=AveCPU,AveRSS,AveVMSize,MaxRSS,MaxVMSize
  • Show running or completed job (and steps) system-level utilization from the accounting information, and with full details:
1
2
sacct -j $jobid
sacct -j $jobid -l
  • Show statistics relevant to the job allocation itself not taking steps into consideration, and with more details:
1
2
sacct -X -j $jobid
sacct -X -j $jobid -l
  • Show a subset of interesting statistics from a completed job and its steps, including:
    1. elapsed time in both human readable and total # of seconds
    2. maximum resident set size of all tasks in job (you may want to add also maxrssnode and maxrsstask for a better understanding of which process consumed memory)
    3. maximum virtual memory size (idem for maxvmsizenode and maxvmsizetask)
    4. consumed energy (in Joules), be aware there are many caveats!
      • your job needs to be the only one running on the corresponding compute nodes
      • the RAPL mechanism will not take into account all possible hardware elements which consume power (CPUs, GPUs and DRAM are included)
1
sacct -j $jobid --format=account,user,jobid,jobname,partition,state,elapsed,elapsedraw,start,end,maxrss,maxvmsize,consumedenergy,consumedenergyraw,nnodes,ncpus,nodelist
  • Output the same statistics in the parseable |-delimited format, for a single and multiple jobs:
    • use sacct -e to see a list of available output fields
1
2
sacct -p -j $jobid --format=account,user,jobid,jobname,partition,state,elapsed,elapsedraw,start,end,maxrss,maxvmsize,consumedenergy,consumedenergyraw,nnodes,ncpus,nodelist
sacct -p -j $jobid1,$jobid2 --format=account,user,jobid,jobname,partition,state,elapsed,elapsedraw,start,end,maxrss,maxvmsize,consumedenergy,consumedenergyraw,nnodes,ncpus,nodelist
  • Show statistics for all personal jobs started since a particular date, then without job steps:
1
2
sacct --starttime 2017-05-01 -u vplugaru
sacct -X --starttime 2017-05-01 -u vplugaru

Pausing, resuming and cancelling jobs

  • To stop a waiting job from being scheduled and later to allow it to be scheduled:
1
2
scontrol hold $jobid
scontrol release $jobid
  • To pause a running job and then resume it:
1
2
scontrol suspend $jobid
scontrol resume $jobid
  • To remove a job from the queue (stopping it if already started):
1
scancel $jobid
  • To remove a job by name:
1
2
scancel --name=$jobname
scancel -n $jobname
  • To remove all user jobs:
1
2
scancel --user=$USER
scancel -u $USER
  • To remove all waiting jobs (pending state) for a given user:
1
2
scancel --user=$USER --state=pending
scancel -u $USER -t pending
  • To remove all waiting jobs in a given partition (e.g. batch):
1
2
scancel -u $USER --partition=batch
scancel -u $USER -p batch
  • To stop and restart a given job:
1
scontrol requeue $jobid