The OAR Batch Scheduler
OAR is an open-source batch scheduler which provides simple yet flexible facilities for the exploitation of a cluster. It manages resources of clusters as a traditional batch scheduler (as PBS / Torque / LSF / SGE). It is used in particular on the Grid’5000 platform. The official documentation is available here:
We provide a set of launcher scripts that may help your workflow.
Quick start
This small section is here to get you started as quickly as possible on the UL HPC platform.
More indepth details you will find in the next sections and you are encouraged to read them.
Using a computing node (server) interactively
- Get one computing core for one hour:
oarsub -I -l nodes=1/core=1,walltime=1:0:0
- Get one core for 5 minutes:
oarsub -I -l nodes=1/core=1,walltime=0:5:0
- Get two cores on the same node for one hour:
oarsub -I -l nodes=1/core=2,walltime=1:0:0
- Get two cores on different nodes for one hour:
oarsub -I -l nodes=2/core=1,walltime=1:0:0
- Get four cores (total) on two different nodes for one hour:
oarsub -I -l nodes=2/core=2,walltime=1:0:0
- Get 10 cores, possibly on different nodes (!), for one hour:
oarsub -I -l core=10,walltime=1
- Get 100 cores for one hour:
oarsub -I -l core=100,walltime=1
- Get all cores and memory (implicitly) of a node for one hour:
oarsub -I -l nodes=1,walltime=1
- Get all cores and memory (implicitly) of four nodes for 25 minutes:
oarsub -I -l nodes=4,walltime=0:25
- Get as many cores as possible on a node which has Xeon Haswell processors for one hour:
oarsub -I -l nodes=1/core=BEST,walltime=1 -p "cputype='xeon-haswell'"
- Get as many cores as possible on a node which has NVIDIA K80 GPUs for one hour:
oarsub -I -l nodes=1/core=BEST,walltime=1 -t gpu -p "gputype='K80'"
- Get a large memory machine (at least 1TB RAM) for 10 minutes:
oarsub -l nodes=1,walltime=0:10 -t bigmem
- Get a large memory machine with Xeon Haswell processors for 10 minutes:
oarsub -l nodes=1,walltime=0:10 -t bigmem -p "cputype='xeon-haswell'"
Notes:
- interactively means that as soon as your OAR job starts, your terminal will be connected to the first computing node associated with your OAR job.
- you close the interactive session with the command
exit
- requesting e.g. 100 computing cores will only reserve them for you, your application still needs to be able to use them by implementing a parallelism model (!).
- you can connect (ssh) between computing nodes in your reservation with:
oarsh nodename
- job details are available from within the job itself in the
OAR_JOB_ID
,OAR_NODEFILE
Using a computing node in batch (unattended) mode
- Run
mycommand
from the current directory on one computing core for one hour:
oarsub -l nodes=1/core=1,walltime=1 ./mycommand
- Run
mycommand
from a given directory on 128 computing cores for 10 minutes:
oarsub -l core=128,walltime=0:10 /path/to/mycommand
- Run
mycommand parameter1 parameter2
that is in your PATH (environment) on a computing node for one hour, giving it a name such that it can be easily identified later:
oarsub -n jobname -l nodes=1,walltime=1 "mycommand parameter1 parameter2"
Notes:
- your terminal will not be connected to the job when it starts (after job submission you are still on the cluster access node)
- you can connect to your running batch job with its known OAR job id with:
oarsub -C jobid
Minimal batch script examples:
- Start a job through a script from the current directory which contains all your requirements and processing commands:
oarsub -S ./myscript
. Example myscript:
1 2 3 4 5 6 |
|
- Example script which loads a specific software module and runs that software in a given directory, saving its standard output and standard error streams to different files:
1 2 3 4 5 6 7 |
|
- Example script which runs MPI parallel software using Intel MPI (toolchain/ictce module containing also compilers and libraries) on 128 cores on Xeon Haswell CPUs:
1 2 3 4 5 6 7 8 |
|
- Example script which runs MPI parallel software using Open MPI (mpi/OpenMPI module) on 128 cores on Xeon Haswell CPUs:
1 2 3 4 5 6 7 8 |
|
Notes:
- Your script needs to be executable (
chmod +x myscript
) to be run in this way. - The same options you use on the command line for
oarsub
you can provide inside the script (launched withoarsub -S ./myscript
), prefixed by the#OAR
pragma. - There are many more useful options available in oarsub, check
man oarsub
to see the full listing.
Checking the status of your job and stopping a job
- Check the status of all jobs on the cluster:
oarstat
- Check the status of your jobs on the cluster:
oarstat -u
- Get the status of a specific job by its OAR job id:
oarstat -j jobid
- Get just the status of a specific job:
oarstat -s -j jobid
- Get the full details of a specific job:
oarstat -j jobid -f
- Get the full details of all your jobs:
oarstat -u -f
- Stopping or removing a job before it started:
oardel jobid
- Stopping or removing two jobs:
oardel jobid1 jobid2
Notes:
- The status is given in the S column of the
oarstat
output: W - Waiting, R - Running - A ‘Terminated’ state in
oarstat -s -j jobid
output indicates that the last command in the job exited with a 0 return code. - The ‘Error’ status in
oarstat -s -j jobid
output indicates that the return code was different than 0 (may not indicate a problem with the job itself).
Concepts
Reservation is handled on the front-end server by the command oarsub
. For those who may not be familiar with batch scheduler vocabulary, the following definitions are now provided to better understand the different OAR mechanisms:
- Submission: The system decides when your job begins, in order to optimize the global scheduling. If there is no available node, you may have to wait! (corresponds to
oarsub -I
oroarsub scriptName
syntaxes) - Reservation: You decide when your job should begin, provided the node(s) will be available at that date. If you did not specify which node(s), the system will choose them for you. If the requested resources are not available at the date specified when the reservation is made, the reservation fails, and you must either change your resource request or change the job start date.
At the start date, the reservation may only provide part of the resources you requested if some became unavailable (because they broke down meanwhile). (corresponds to
oarsub -r
oroarsub -r scriptName
syntaxes) - Interactive: You just request some nodes, either by submission or reservation, and you then log in manually and work interactively. (corresponds to
oarsub -I
for submission oroarsub -r
;oarsub -C jobId
for reservation) - Passive: You point to a script that should be automatically batched by the system; you don’t need to log to the nodes at all. (corresponds to
oarsub scriptName
for submission oroarsub -r scriptName
for reservation) -
Types of job: There are basically two operating modes:
default
: you just use the nodes default environment, whatever the scheduling (reservation or submission, interactive or passive);best effort
: this is a special operating queue with less priority, as explained below.
Job Type | Submission | Reservation |
---|---|---|
interactive | oarsub -I |
oarsub -r ; oarsub -C jobId |
passive | oarsub scriptName |
oarsub -r scriptName |
OAR provides the following features:
-
A better resource management: Using the Linux kernel feature called cpusets, OAR 2 allows a more reliable management of the resources. In particular,
- No unattended processes should remain after a job completes - ever.
- Access to the resources is now restricted to the de facto owner of the resources. Features like job dependency and check-pointing are now available, allowing better use of resources. A cpuset is attached to every process, and allows:
- to specify which resource processor/memory can be used by a process, e.g. resources allocated to the job in the OAR context.
- to group and identify processes that share the same cpuset, e.g. the processes of a job in OAR context, so that actions like clean-up can be efficiently performed. Here, cpusets provide a replacement for the group/session of processes concept that is not efficient in Linux.
- Resources hierarchies: OAR can manage complex hierarchies of resources. Here, we use the following hierarchy: (1) nodes, (2) cpucore (3) core. You’ll probably be interested only in requesting a given number of nodes or cores
- A modern cluster management system: By providing a mechanism to isolate the jobs at the core level, OAR is one of the most modern cluster management systems. Users developing cluster or grid algorithms and programs will then work in a today’s up-to-date environment similar to the ones they will meet with other recent cluster management systems on production platforms for instance.
- Optimization of the resources usage: Nowadays, machines with more than 4 cores are common. Thus, it is then very important to be able to handle cores efficiently. By providing resources selection and processes isolation at the core level, OAR allows users running experiments that do not require the exclusivity of a node (at least during a preparation phase) to have access to many nodes on one core only, but leave the remaining cores free for other users. This can allow to optimize the number of available resources. Besides, OAR also provide a time-sharing feature which will allow to share a same set of resources among users. This will especially be useful during demonstration or events such as plugtest.
- Easier access to the resources: Using OAR
oarsh
connector to access the job resources, basic usage will not anymore require the user to configure his SSH environment as everything is handled internally (known host keys management, etc). Besides, users that would actually prefer not usingoarsh
can still usessh
with just the cost of some options to set (one of the features of theoarsh
wrapper is to actually hide these options).
Job notion in OAR
In OAR, a job is defined by a number of required resources and eventually a script/program to run. So, the user must specify how many resources and what kind of them are needed by his application. Thus, OAR system will give him or not what he wants and will control the execution. When a job is launched, OAR executes user program only on the first reservation node. The following environment variables are defined once a job is created to characterize the reservation operated:
$OAR_NODEFILE |
contains the name of a file which lists all reserved nodes for this job |
$OAR_JOB_ID |
contains the OAR job identificator |
$OAR_RESOURCE_PROPERTIES_FILE |
contains the name of a file which lists all resources and their properties |
$OAR_JOB_NAME |
name of the job given by the “-n” option of oarsub |
$OAR_PROJECT_NAME |
job project name |
Submitting a job is conducted using the oarsub
command. Mainly, you’ll use this command in two ways:
oarsub [options] -I
: for an interactive job (see previous glossary)oarsub [options] scriptName
: for a passive job to execute the scriptscriptName
(note that this script is only executed on the first reserved node).
The most useful options are the following (see oarsub(1)
for mor details):
-I, --interactive
: Request an interactive job. Open a login shell on the first node of the reservation instead of running a script.-l, --resource=<list>
: Set the requested resources for the job. You may here specify the number of nodes, cpus and cores (separated by a slash ‘/’) and the walltime of the job i.e its duration. Walltime format is hour (hour:mn:sec|hour:mn).- Ex:
-l nodes=2/cpu=1/core=2,walltime=2:00:00
reserves 2 cores on 1 cpu of 2 nodes. -r, --reservation=<date>
: Request a job start time reservation, instead of a direct submission.-n, --name=<txt>
: Specify an arbitrary name for the job--project=<txt>
: Specify a name of a project the job belongs to-d, --directory=<dir>
: Specify the directory where to launch the command (default is current directory)--notify=<txt>
: Specify a notification method (mail or command to execute). Ex:--notify "mail:name@domain.com"
or--notify "exec:/path/to/script args"
-O --stdout=<file>
: Specify the file that will store the standard output stream of the job. (the%jobid%
pattern is automatically replaced)-E --stderr=<file>
: Specify the file that will store the standard error stream of the job. (the%jobid%
pattern is automatically replaced) Once a job is launched, you can access to the resources reserved throught theoarsh
command. Connections throughssh
are prohibited.
Request (hierarchical) resources with oarsub
By default, if you execute oarsub without default parameters, you will request 1 computing core for 2 hours.
In order to request a specific amount of resources, you should use the -l
option of oarsub
and use a hierarchical reservation (characterized with the /
separator).
For instance, to reserve 1 core on 8 nodes for 4h, you can use:
oarsub -l nodes=8/core=1,walltime=4:00:00
…
Other examples are following and probably self-explainatory:
# reserve 4 cores belonging to the same CPU (total: 4 cores)
$> oarsub l cpu=1/core=4 ...
# 2 cores on 3 nodes (same enclosure) for 3h15: (total: 6 cores)
$> oarsub -I -l /enclosure=1/nodes=3/core=2,walltime=3:15
# 4 cores on a GPU node for 8 hours (Total: 4 cores)
$> oarsub -l /core=4,walltime=8 -t gpu
# 2 nodes among the h-cluster1-* nodes (Chaos only) (total: 24 cores)
$> oarsub -l nodes=2 -p "nodeclass='h'" ...
# 4 cores on 2 GPU nodes + 20 cores on other nodes (total: 28 cores)
$> oarsub -I -l "{gpu='YES'}/nodes=2/core=4+{gpu='NO'}/core=20"
Reservation of resources at a given time
You can use the -r
option of oarsub
to specify the date you wish the reservation to be issued.
The date format to pass to the -r
option is: AAAA-MM-DD HH:MM:SS
For instance, the following command reserve 2 cores on 4 nodes (‘‘i.e’’ 8 cores) to launch the script myscript.sh at 23h30:
[16:55:06] hcartiaux@access(chaos-cluster) ~$> oarsub -l nodes=4/core=2 -r "2012-09-24 23:30:00" ./myscript.sh
[ADMISSION RULE] hcartiaux is granted the privilege to do unlimited reservations
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=147550
Reservation mode : waiting validation...
Reservation valid --> OK
Select nodes precisely with properties
You should use in this case what is called OAR properties with the -p
option.
The general syntax for this option is as follows: oarsub -p "< property >='< value >'"
You can combine different properties logically (with AND/OR etc). Ex:
oarsub -p "nodeclass='h' OR nodeclass='d'"
If you want to use a GPU node, use this command: oarsub -I -t gpu
If you want to use nodes from the bigmem
class, try the following: oarsub -I -t bigmem
Likewise, for bigsmp
class, try the following: oarsub -I -t bigsmp
; Please see below for details.
Global properties
Global Property | Description | Example |
---|---|---|
host | Full hostname of the resource | -p “host=’h-cluster1-14.chaos-cluster.uni.lux’” |
network_address | Short hostname of the resource | -p “network_address=’h-cluster1-14’” |
disktype | Type of disk (sas/sata/raid/ssd) | -p “disktype=’sas’” |
memnode / mem | RAM size available per node | -p “memnode=’24’” |
memcpu | RAM size available per core | -p “memcpu=’12’” |
memcore | RAM size available per core | -p “memcore=’2’” |
cpucore | Number of cores per CPU | -p “cpucore=’6’” |
cpufreq | Frequency of the processor | -p “cpufreq=’2.26’” |
enclosure | enclosure ID (same IB+Ethernet switch) | -p “enclosure=’1’” |
nodemodel | Node model name | -p “nodemodel=’Bull_B500’” |
gpu | GPU availability | -p “gpu=’YES’” |
gputype | GPU card model (M2070, M2090) | -p “gputype=’M2090’” |
gpuecc | GPU ECC feature (YES, NO) | -p “gpuecc=’YES’” |
os | Operating System of the host | -p “os=’debian8’” |
Chaos and Gaia properties
Chaos is heterogeneous, therefore, we provide properties in order to permit the reservation of a homogeneous subset of nodes. Gaia is homogeneous, at least for the default job submissions.
Here is a summary of the most useful properties (you can see them on Monika for Chaos and Gaia):
property | Description | Example |
---|---|---|
nodeclass | Class of node i.e. sub-cluster considered | -p “nodeclass=’h’” |
room | Location of the node (server room), AS28 or CS43 | -p “nodeclass=’AS28’” |
Connecting to the reserved nodes
Assuming you have a job running and therefore a set of resources reserved for you on the cluster, you can connect to the first reserved nodes using
oarsub -C <JOB_ID>
Then you can connect to the other reserved nodes using oarsh
. Example:
[10:53][user@access:~]: oarsub -C 2802044
Connect to OAR job 2802044 via the node gaia-48
[10:53][user@gaia-48:~]: cat $OAR_NODEFILE
gaia-48
gaia-48
gaia-48
gaia-48
gaia-48
gaia-48
gaia-48
gaia-48
gaia-67
gaia-67
[10:53][user@gaia-48:~]: oarsh gaia-67
Warning: Permanently added '[gaia-67]:6667,[10.226.1.67]:6667' (RSA) to the list of known hosts.
Last login: Mon Feb 24 13:41:44 2014 from access.gaia-cluster.uni.lux
[10:53][user@gaia-67:~]: logout
Connection to gaia-67 closed.
Select bigsmp and bigmem nodes
Some nodes are very specific (the nodes with >= 1TB of memory and the BCS computing node of Gaia with 160 cores in ccNUMA architecture), and can only be reserved with an explicit oarsub parameter: -t bigmem
for -t bigsmp
:
Cluster | Type | Node | # cores | Memory | Oarsub example |
---|---|---|---|---|---|
chaos | bigmem | r-cluster1-1 | 32 | 1024GB | oarsub -I -t bigmem |
gaia | bigsmp+bigmem | gaia-73 | 160 | 1024GB | oarsub -I -t bigsmp -p “network_address=’gaia-73’” |
gaia | bigmem | gaia-74 | 32 | 1024GB | oarsub -I -t bigmem --project project_biocore -p “network_address=’gaia-74’” |
gaia | bigsmp+bigmem | gaia-80 | 120 | 3072GB | oarsub -I -t bigsmp --project project_rues |
gaia | bigsmp+bigmem | gaia-81 | 160 | 4096GB | oarsub -I -t bigsmp --project project_sgi |
gaia | bigmem | gaia-183 | 64 | 2048GB | oarsub -I -t bigmem --project project_biocore -p “network_address=’gaia-183’” |
gaia | bigmem | gaia-184 | 64 | 2048GB | oarsub -I -t bigmem -p “network_address=’gaia-184’” |
Please, only use these facilities if your jobs strictly require them, otherwise queueing is increased.
Additionally, it is preferable to reserve the complete node with the parameter -l nodes=1
, and adapt your workflow consequently in order to make profit of their full potential (exception: bigmem/bigsmp class).
The -\\\-project
parameter is required to access some of these special computing systems as they are dedicated to a specific group that you must be a part of.
Select moonshot nodes (on Gaia)
Since 2015, the Gaia cluster includes HP Moonshot nodes that feature energy efficient, low power Xeon CPUs.
As these nodes have a specific configuration (4 cores/node, 10GbE networking and no Infiniband), they can only be reserved with an explicit oarsub parameter: -t moonshot
:
Cluster | Type | Node | # cores | Memory | Oarsub example |
---|---|---|---|---|---|
gaia | moonshot | moonshot1-[1-45] | 180 | 1440GB | oarsub -I -t moonshot |
gaia | moonshot | moonshot2-[1-45] | 180 | 1440GB | oarsub -I -t moonshot |
Container
With OAR, it is possible to execute jobs within another one. This functionality is called “container jobs”.
First, a job of type container must be submitted, for example:
hcartiaux@access(gaia-cluster) ~$> oarsub -I -t container -l nodes=3,walltime=2:10:00
OAR_JOB_ID=723303
Interactive mode : waiting...
Starting...
Connect to OAR job 723303 via the node gaia-12
Then it is possible to use the inner type to schedule the new jobs within the previously created container job:
hcartiaux@access(gaia-cluster) ~$> oarsub -I -t inner=723303 -l core=16
OAR_JOB_ID=723557
Interactive mode : waiting...
Starting...
Connect to OAR job 723557 via the node gaia-11
Note that an inner job can not be a reservation (ie. it cannot overlap the container reservation).
‘besteffort’ versus ‘default’
By default, your jobs end in the default
queue meaning they have all equivalent priority.
You can also decide to create so called best-effort jobs which are scheduled in the besteffort queue. Their particularity is that they are deleted if another not besteffort job wants resources where they are running.
Here is an example of a simple oarsub command, which submits a besteffort job: oarsub -t besteffort /path/to/prog
For example you can use this feature to maximize the use of your cluster with multiparametric jobs. When you submit a job you have to use -t besteffort
option of oarsub
to specify that this is a besteffort job. You have interest in using best-effort jobs in the sense that their associated constraint (wall-time and maximum number of active jobs per user) are more relax than regular jobs.
They are summarized below.
Job Type | Max Walltime (hour) | Max #active_jobs | Max #active_jobs_per_user |
---|---|---|---|
default | 120:00:00 | 30000 | 50 |
besteffort | 9000:00:00 | 10000 | 1000 |
Important : a besteffort job cannot be a reservation.
If your job is of the type besteffort and idempotent (oarsub “-t” option) and killed by the OAR scheduler, then another job is automatically created and scheduled with same configuration. Additionally, your job is also resubmitted if the exit code of your program is 99. This is extremely useful facility for jobs that can be restarted and provides certain advantages for some workflows.
Consequently, bestefforts jobs allow you to cut your computation in small slots and exceed the policy restrictions for default jobs, without annoying the workflows of the other users. Idempotent jobs will be resubmitted indefinitely until their completion.
This workflow assumes that you implement the needed changes in your program, or launcher scripts, and that you tolerate a loss of cpu time in some cases.
Here is an example of a oarsub command, which submits a besteffort / idempotent job: oarsub -t besteffort -t idempotent /path/to/prog
Note: If you are a member of the besteffortusers
group on the cluster, then ALL your jobs will be by default of type besteffort and you will be notified by OAR as follows:
yourlogin@access ~> oarsub [...]
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
[ADMISSION RULE] !!!! WARNING !!!
[ADMISSION RULE] !!!! AS AN EXTERNAL USER, YOU HAVE BEEN AUTOMATICALLY !!!
[ADMISSION RULE] !!!! REDIRECTED TO THE BEST-EFFORT QUEUE !!!
[ADMISSION RULE] !!!! YOUR JOB MAYBE KILLED WITHOUT NOTICE !!!
Scheduling priority (karma)
The OAR batch scheduler uses internally a karma value in order to determine the priority of the user jobs.
Assuming that:
- user_requested: number of (cores x hours) requested by the user over the last 30 days
- all_requested: number of (cores x hours) requested by all the users over the last 30 days
- user_used: number of (cores x hours) used by the user over the last 30 days
- all_used: number of (cores x hours) used by all the users over the last 30 days
Karma = 2 x user_used / all_used + user_requested / all_requested
The requested values correspond to the walltimes the user specifies on the oarsub
command line.
The used values correspond to the actual timespan of the job, i.e. end_time - start_time.
If a job uses the full walltime, then used is the same as asked, otherwise used < asked.
Important When scheduling the jobs OAR will favour jobs with a lower karma.
Practically, this means that:
- low usage in the last 30 days => low karma => more priority
- high usage in the last 30 days => high karma => less priority
- if a user asks for a walltime much longer than he/she actually uses, his/her karma will be higher than that of a different user with the same usage but who has correctly set the job specification
Checkpointing
Definition from wikipedia: “Checkpointing is a technique for inserting fault tolerance into computing systems. It basically consists of storing a snapshot of the current application state, and later on, use it for restarting the execution in case of failure.”
Checkpointing your job allows to enable the following features:
- The job can be stopped/restarted at will
- The job can survive scheduled or unscheduled downtimes
- The job can overcome queue time-limits (eg. 10 or 2 or 1 days, that gets fully irrelevant!). eg. 500h jobs? no problem!
- The job minimizes its waiting time in the queue since it asks for less resources (in mutiple batches, sure). Finally, if you have jobs that get killed due to reaching walltime limits -which you can’t forecast in advance- you can overcome that problem too, in the most elegant way.
In fact, if your jobs run for more than 1 day, the “social” way to do HPC involves checkpoint; we understand that users often run code developed by third-parties so they can’t do much about it, but then again, did you ask the software developers about the feature? Kindly do so at the first opportunity, to increase the quality of your work.
OAR integration
The workflow described above is implementable by combining several OAR features:
- besteffort jobs, described in the previous section ;
- idempotent: if your processus returns an exit code equal to 99, your job will be resubmitted with the same parameters ;
- checkpoint parameter: enable the checkpointing mechanism, specifies the time in seconds before sending a signal to the first processus of the job ;
- signal parameter: specify which signal to use when checkpointing (default is SIGUSR2).
Example
Example: oarsub --checkpoint 600 --signal 12 -t besteffort -t idempotent /path/to/prog
So, this job will send a signal SIGUSR2 (12), 600 seconds before the walltime ends. Then if the program returns the exit code 99, it will be resubmitted. Note that if OAR kills a best-effort job in order to schedule a default job, no signal will be sent.
Your program, which will probably be a launcher, can trap the checkpointing signal, and implement a “checkpoint - restart” feature in a few lines of code. You can read these examples of launchers written in bash (you will probably have to adapt them to your case):
- launcher_besteffort.sh, a simple example, which will just forward the OAR checkpointing signal to the process with
trap
andkill
commands and resubmit itself ; - launcher_checkpoint_restart.sh, which uses BLCR (Berkeley Lab Checkpoint/Restart) in order to store the context of processes in a file and restart them from their saved state.
In these two examples, the oar parameters are given in the header of the script, you can submit them directly with the -S
parameter.
If you are unfamiliar with the signal mechanisms of Unix, this could be an easy start: wikipedia
Statistics with oarstat
-
You can visualize all the submitted jobs with the command
oarstat
.[17:06:10] hcartiaux@access(gaia-cluster) ~$> oarstat Job id Name User Submission Date S Queue ---------- -------------- -------------- ------------------- - ---------- 600321 node maintenan fgeorgatos 2012-09-19 13:36:53 R default 715116 P50_cont_0 sdorosz 2012-09-24 08:22:22 R default 715117 P50_cont_1 sdorosz 2012-09-24 08:22:23 R default 715118 P50_cont_2 sdorosz 2012-09-24 08:22:23 R default 715119 P50_cont_3 sdorosz 2012-09-24 08:22:23 R default ...
- View the details with the
-f parameter
:oarstat -f
-
Select a specific job with the
-j
parameter, followed by its job ID.[17:11:22] hcartiaux@access(gaia-cluster) ~$> oarstat -f -j 600321 Job_Id: 600321 job_array_id = 600321 job_array_index = 1 name = node maintenance project = default owner = fgeorgatos state = Running wanted_resources = -l "{type = 'default'}/host=1/core=12,walltime=168:0:0" assigned_resources = 397+398+399+400+401+402+403+404+405+406+407+408 assigned_hostnames = gaia-34 queue = default command = /bin/sleep 600000 ...
-
View the status of a job with the
-s
parameter:oarstat -s -j 600321 600321: Running
-
View all jobs submitted by a user with
-u
parameter:[17:13:35] hcartiaux@access(gaia-cluster) ~$> oarstat -u fgeorgatos Job id Name User Submission Date S Queue ---------- -------------- -------------- ------------------- - ---------- 600321 node maintenan fgeorgatos 2012-09-19 13:36:53 R default
Visualization tools for cluster activity
OAR comes with two monitoring tools, each of them installed on the cluster front-end:
-
Monika is a web interface which monitors batch scheduler reservations. It tries to display a very synthetic view of the current cluster state with all active and waiting jobs.
-
Draw OAR gantt creates a Gantt chart which shows job repartition on nodes in the time. It is very useful to see cluster occupation in the past and to know when a job will be launched in the future.
Typical example of job submission
## Default Interactive job oarsub -I
By default, 1 core is reserved and the default walltime is 2h (maximum walltime is set to 12 hours for interactive jobs). Each job receive an id (stored in $OAR_JOB_ID
on the first reserved node).
[14:47:26] svarrette@access ~> oarsub -I
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
Generate a job key...
OAR_JOB_ID=76715
Interactive mode : waiting...
Starting...
Connect to OAR job 76715 via the node d-cluster1-9
Use of d-cluster1-9 :
14:49:24 up 10 days, 22:27, 1 user, load average: 3.00, 3.00, 3.00
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
oar pts/1 cluster1a.chaos. 14:49 0.00s 0.00s 0.00s sshd: oar [priv]
TTY = /dev/pts/1 , TERM = xterm-color, no DISPLAY
[14:49:24] svarrette@d-cluster1-9 ~> echo $OAR_NODEFILE
/var/lib/oar/76715
[14:51:03] svarrette@d-cluster1-9 ~> cat $OAR_NODEFILE
d-cluster1-9.chaos.lu
[14:51:10] svarrette@d-cluster1-9 ~> echo $OAR_JOB_ID
76715
Of course, you can specialize the walltime of your reservation (ex:8h), the number of cores/nodes etc. Ex:
oarsub -I -l core=2,walltime=8
Use this type of submission when you want to compile and/or check a given aspect of your program/script etc.:
From the front-end, you can check the current jobs associated to your login by issuing:
oarstat [-f] -u yourlogin
You can also connect to one of the reserved node using the oarsh
utility as follows:
OAR_JOB_ID=< jobid > oarsh < nodename >
Ex: OAR_JOB_ID=76715 oarsh d-cluster1-9
Any other attempt to connect will fail (using ssh
or oarsh
without job ID):
[15:20:00] svarrette@access ~> ssh d-cluster1-9
********************************************************
/!\\ WARNING: Direct login by ssh is forbidden.
Use oarsub(1) to reserve nodes, and oarsh(1) to connect to your reserved nodes,
typically by:
OAR_JOB_ID=<jobid> oarsh <nodename>
User doc: https://hpc.uni.lu/tiki-index.php?page=User+Documentation
********************************************************
Connection closed by 192.168.200.59
[15:20:05] svarrette@access ~> oarsh d-cluster1-9
oarsh: Cannot connect. Please set either a job id or a job key in your
oarsh: environment using the OAR_JOB_ID or the OAR_JOB_KEY_FILE variable.
[15:24:00] svarrette@access ~>
Once you’ve finished, just execute ‘CTRL+D’ or ‘logout’ to leave the reservation.
Default Passive job oarsub scriptname
Once you ensure your program etc. is working correctly in interactive mode, it is time to list the commands you want to do in a script which name is given to oarsub
.
This script is to be executed on the first reserved node once the resources are attributed. It means also that this script has access to the OAR environment variables ( $OAR_NODEFILE
etc.).
You will probably end up in one of the following cases:
- you want to execute an instance of the same sequential program
myprog
on the allocated resources. Typically, each execution receive a different parameter and you’ll gain to benefit from the many cores available. - you want to run a truly parallel program written with a parallel library such as OpenMP/MPI/CUDA/OpenCL.
We have set a GitHub repository (ULHPC) to give you templates for launcher script you can inspire as they are meant to cover the main workflow met until now on the cluster. It also offers you the possibility to you to contribute to the repository by proposing your own launcher script etc.
For debugging reasons, you are requested to ALWAYS try your scripts in interactive mode prior to their invocation in passive mode. At the end of the run, two files are created in the current directory:
OAR.%jobid%.stdout
for standard output produced during script executionOAR.%jobid%.stderr
for error output produced during script execution You can change the name of those files through options -E and -O. Another interresting option is--notify
which helps to notify the end of the script (typically by mail).
Note: the job will end (or be killed) whenever one of the following event happens first:
- the script execution ends (successfully or otherwise);
- the walltime expires.
IMPORTANT: to prevent filling the storage space with unnecessary files, always remember to clean, i.e remove, the OAR log files as soon as possible.
Submission constraints / limitations
Depending on the type of job, you will face the following constraints:
Job Type | Max Walltime (hour) | Max #active_jobs | Max #active_jobs_per_user |
---|---|---|---|
interactive | 12:00:00 | 10000 | 5 |
default | 120:00:00 (5 days) | 30000 | 50 |
besteffort | 9000:00:00 (375 days) | 10000 | 1000 |
If you (really) need to run jobs that require more than 3 days of computations,
- Ask yourself if you really exploit all the parallel resources offered to you (i.e. see if GNU parallel can help you to speedup your computation)
- try to use
besteffort
jobs - retry to use
besteffort
jobs - really try to use
besteffort
jobs ;) - consider buying a dedicated hardware
- we can define in a very few (and well justified) cases, dedicated projects that have individual (and independent) constraints. More precisely, for each project
name
, a new OAR propertyfor_name: YES/NO
is created, together with an LDAP groupproject_name
. This property is set to YES on dedicated resources such that users members of theproject_name
group are granted the use of"oarsub --project project_name"
syntax to create jobs limited to the constraints of the project.
OAR API
The OAR REST API allows to interact with OAR over HTTP using a REST library. Most of the operations usually done with the oar Unix commands may be done using this API from your favourite language.
The OAR REST API is installed on the cluster front-ends (access-chaos.uni.lu
and access-gaia.uni.lu
) and is available at this URL: https://localhost/oarapi/
For more information, refer to the official documentation.
-
Get the information corresponding to a job id
14:37:41 hcartiaux@access(chaos-cluster) ~ $ curl -k "https://localhost/oarapi/jobs/1174702.yaml" --- api_timestamp: 1421148323 array_id: 1174702 array_index: 1 command: '' cpuset_name: hcartiaux_1174702 ... start_time: 1421148317 state: Terminated stderr_file: OAR.1174702.stderr stdout_file: OAR.1174702.stdout stop_time: 1421148321 submission_time: 1421148315 type: INTERACTIVE types: - interactive walltime: 7200 wanted_resources: "-l \"{type = 'default'}/core=1,walltime=2:0:0\" "
-
Get the list of nodes used by a job
14:37:41 hcartiaux@access(chaos-cluster) ~ $ curl -k "https://localhost/oarapi/jobs/1174702/nodes.yaml" --- api_timestamp: 1421156272 items: - api_timestamp: 1421156272 links: - href: /oarapi/resources/nodes/e-cluster1-13 rel: self network_address: e-cluster1-13 status: assigned links: - href: /oarapi/jobs/1174702/nodes.yaml rel: self offset: 0 total: 1
-
List the existing resources
curl -k 'https://localhost/oarapi/resources.yaml?structure=simple' 14:38:37 hcartiaux@access(chaos-cluster) ~ $ curl -k 'https://localhost/oarapi/resources.yaml?structure=simple' | head --- api_timestamp: 1421156321 items: - api_timestamp: 1421156321 available_upto: 0 id: 1 links: - href: /oarapi/resources/nodes/k-cluster1-1 rel: member title: node ...
-
Submit a job
14:38:42 hcartiaux@access(chaos-cluster) ~ $ curl -k -X POST https://localhost/oarapi/jobs.yaml -d 'resources=core=1&command=sleep 60&name=Test' --- api_timestamp: 1421156412 cmd_output: | [ADMISSION RULE] Set default walltime to 7200. [ADMISSION RULE] Modify resource description with type constraints OAR_JOB_ID=1174742 id: 1174742 links: - href: /oarapi/jobs/1174742 rel: self
-
Send the checkpoint signal to a running job
15:12:51 hcartiaux@access(chaos-cluster) ~ $ curl -k -X POST https://localhost/oarapi/jobs/1174756/checkpoints/new.yaml --- api_timestamp: 1421158380 cmd_output: | Checkpointing the job 1174756 ...DONE. The job 1174756 was notified to checkpoint itself on e-cluster1-13. id: 1174756 links: - href: /oarapi/jobs/1174756 rel: self status: Checkpoint request registered
-
Delete a job
curl -k -X POST https://localhost/oarapi/jobs/1174754/deletions/new.yaml --- api_timestamp: 1421157682 cmd_output: | Deleting the job = 1174754 ...REGISTERED. The job(s) [ 1174754 ] will be deleted in a near future. id: 1174754 links: - href: /oarapi/jobs/1174754 rel: self status: Delete request registered
-
Example using ruby and rest-client (we submit a simple job and we load its detailed information into a Hash)
16:42:55 hcartiaux@access(chaos-cluster) ~ $ restclient https://localhost/oarapi irb(main):001:0> require 'pp' => true irb(main):002:0> result=YAML.load(post('jobs.yaml', {:command => "sleep 60"})) => {"cmd_output"=>"[ADMISSION RULE] Set default walltime to 7200. ... irb(main):003:0> pp(result) {"cmd_output"=> "[ADMISSION RULE] Set default walltime to 7200.\n[ADMISSION RULE] Modify resource description with type constraints\nOAR_JOB_ID=1174761\n", "id"=>1174761, "api_timestamp"=>1421163831, "links"=>[{"href"=>"/oarapi/jobs/1174761", "rel"=>"self"}]} => nil irb(main):004:0> result=YAML.load(get('jobs/1174761.yaml')) => {"types"=>[], "start_time"=>1421163832, ... irb(main):005:0> pp(result) {"types"=>[], "start_time"=>1421163832, "properties"=>"(bigmem='NO' AND bigsmp='NO') AND dedicated='NO'", "scheduled_start"=>nil, "dependencies"=>[], "resubmit_job_id"=>0, "reservation"=>"None", "exit_code"=>0, "command"=>"sleep 60", "stop_time"=>1421163893, "owner"=>"hcartiaux", ......... "type"=>"PASSIVE", "stdout_file"=>"OAR.1174761.stdout", "array_id"=>1174761} => nil irb(main):006:0>
Troubleshooting
-
Have a look at FAQ or Report a problem