Logo

HPC @ Uni.lu

High Performance Computing in Luxembourg

This website is deprecated, the old pages are kept online but you should refer in priority to the new web site hpc.uni.lu and the new technical documentation site hpc-docs.uni.lu

The following sections showcase different batch scripts you can use as launchers for your applications.

While they show the most common utilization patterns, many more possibilities exist, thus you are encouraged to check out the reference documentation links.

You will normally submit them to the queue using sbatch $scriptname, with any options on the sbatch command line overriding the equivallent option from within the script.

Basic launchers

  • Request one core for 5 minutes in the batch queue and print a message:
1
2
3
4
5
6
7
8
9
#!/bin/bash -l
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --time=0-00:05:00
#SBATCH -p batch
#SBATCH --qos=normal

echo "Hello from the batch queue on node ${SLURM_NODELIST}"
# Your more useful application can be started below!
  • Request two cores on each of two nodes with Skylake CPUs, for 3 hours and print some messages:
    • note the -C option that adds a constraint to the job to run only on nodes with the skylake feature
1
2
3
4
5
6
7
8
9
10
11
12
13
#!/bin/bash -l
#SBATCH -N 2
#SBATCH --ntasks-per-node=2
#SBATCH --time=0-03:00:00
#SBATCH -p batch
#SBATCH --qos=normal
#SBATCH -C skylake

echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
echo "== Node list: ${SLURM_NODELIST}"
echo "== Submit dir. : ${SLURM_SUBMIT_DIR}"
# Your more useful application can be started below!
  • Give the job a name and request for emails to be sent when it finishes (with or without a success exit status):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash -l
#SBATCH -J MyTestJob
#SBATCH --mail-type=end,fail
#SBATCH --mail-user=Your.Email@Address.lu
#SBATCH -N 2
#SBATCH --ntasks-per-node=2
#SBATCH --time=0-03:00:00
#SBATCH -p batch
#SBATCH --qos=normal

echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
echo "== Node list: ${SLURM_NODELIST}"
echo "== Submit dir. : ${SLURM_SUBMIT_DIR}"
# Your more useful application can be started below!
  • Submit an array job that will create 10 jobs, for parametric executions:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/bin/bash -l
#SBATCH -J MyTestJob
#SBATCH --mail-type=end,fail
#SBATCH --mail-user=Your.Email@Address.lu
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --time=0-01:00:00
#SBATCH --array=0-9
#SBATCH -p batch
#SBATCH --qos=normal

echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}, Task ID: ${SLURM_ARRAY_TASK_ID}"
echo "== Node list: ${SLURM_NODELIST}"
echo "== Submit dir. : ${SLURM_SUBMIT_DIR}"
# Run your application as a job step,  passing its unique array id
# (based on which varying processing can be done)
srun /path/to/your/application $SLURM_ARRAY_TASK_ID
  • Submit an array job, passing to each application execution a custom value and limit # of jobs running simultaneously to 3:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/bin/bash -l
#SBATCH -J MyTestJob
#SBATCH --mail-type=end,fail
#SBATCH --mail-user=Your.Email@Address.lu
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --time=0-01:00:00
#SBATCH --array=0-9%3
#SBATCH -p batch
#SBATCH --qos=normal

echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}, Task ID: ${SLURM_ARRAY_TASK_ID}"
echo "== Node list: ${SLURM_NODELIST}"
echo "== Submit dir. : ${SLURM_SUBMIT_DIR}"
# Run your application passing it a custom value.
# Careful, # of values has to match # array jobs!
VALUES=(2 3 5 7 11 13 17 19 23 29)
srun /path/to/your/application ${VALUES[$SLURM_ARRAY_TASK_ID]}
  • Request one core and half the memory available on an iris cluster regular node for one day (e.g. for sequential code requesting a lot of memory):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash -l
#SBATCH -J MyLargeMemorySequentialJob
#SBATCH --mail-type=end,fail
#SBATCH --mail-user=Your.Email@Address.lu
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mem=64GB
#SBATCH --time=1-00:00:00
#SBATCH -p batch
#SBATCH --qos=normal

echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
echo "== Node list: ${SLURM_NODELIST}"
echo "== Submit dir. : ${SLURM_SUBMIT_DIR}"
# Your more useful application can be started below!
  • Run a single core, 3 day long job in the dedicated partition:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash -l
#SBATCH -J MyLongJob
#SBATCH --mail-type=all
#SBATCH --mail-user=Your.Email@Address.lu
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --time=3-00:00:00
#SBATCH -p long
#SBATCH --qos=long

echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
echo "== Node list: ${SLURM_NODELIST}"
echo "== Submit dir. : ${SLURM_SUBMIT_DIR}"
# Your more useful application can be started below!
  • Run a best-effort job on a single node:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash -l
#SBATCH -J MyRerunnableJob
#SBATCH --mail-type=end,fail
#SBATCH --mail-user=Your.Email@Address.lu
#SBATCH -N 1
#SBATCH --ntasks-per-node=28
#SBATCH --time=0-12:00:00
#SBATCH -p batch
#SBATCH --qos=besteffort
#SBATCH --requeue

echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
echo "== Node list: ${SLURM_NODELIST}"
echo "== Submit dir. : ${SLURM_SUBMIT_DIR}"
# Your more useful application can be started below!

GPU accelerated nodes

  • Run a job on a single node, requesting 2 GPUs
    • job books also 2 tasks each with 7 cores, giving access to effectively half of the compute node in terms of computing power and memory
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash -l
#SBATCH -J MyGPUJob
#SBATCH --mail-type=all
#SBATCH --mail-user=Your.Email@Address.lu
#SBATCH -N 1
#SBATCH -n 2
#SBATCH -c 7
#SBATCH --gpus=2
#SBATCH --time=12:00:00
#SBATCH -p gpu

echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
echo "== Node list: ${SLURM_NODELIST}"
echo "== Submit dir. : ${SLURM_SUBMIT_DIR}"
nvidia-smi
  • Run a job on two nodes, requesting 8 GPUs (4 per node), each with 32GB of on-board HBM2 memory
    • job books completely the two nodes by allocating one task for each GPU and 7 cores for each task (28 cores on each system)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash -l
#SBATCH -J MyGPUJob
#SBATCH -N 2
#SBATCH --ntasks-per-node=4
#SBATCH -c 7
#SBATCH --gpus-per-node=4
#SBATCH -C volta32
#SBATCH --time=1-0:0:0
#SBATCH -p gpu

echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
echo "== Node list: ${SLURM_NODELIST}"
echo "== Submit dir. : ${SLURM_SUBMIT_DIR}"
# Your more useful application can be started below!

Large memory nodes

  • Request one task with 64 cores and 2TB of RAM for a multithreaded application on a large memory node:
    • note that additional cores may be reserved for the job in order to preserve the Memory-per-Core ratio (at ~26GB RAM/core on large memory and GPU nodes)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash -l
#SBATCH -J MyVeryLargeMemoryJob
#SBATCH --mail-type=end,fail
#SBATCH --mail-user=Your.Email@Address.lu
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 64
#SBATCH --mem=2TB
#SBATCH --time=1-00:00:00
#SBATCH -p bigmem

echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
echo "== Node list: ${SLURM_NODELIST}"
echo "== Submit dir. : ${SLURM_SUBMIT_DIR}"
# Your more useful application can be started below!

Advanced launchers (for parallel code)

  • Single node, threaded (pthreads/OpenMP) application launcher, using all 28 cores of an iris cluster node:
    • --ntasks-per-node=1 and -c 28 are taken into account only if using srun to launch your application
1
2
3
4
5
6
7
8
9
10
11
#!/bin/bash -l
#SBATCH -J ThreadedJob
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH -c 28
#SBATCH --time=0-01:00:00
#SBATCH -p batch
#SBATCH --qos=normal

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun /path/to/your/threaded.app
  • Single node, multi-core parallel application (MATLAB, Python, R, etc.) launcher, using all 28 cores of an iris cluster node:
    • --ntasks-per-node=28 and -c 1 are taken into account only if using srun to launch your application
    • here we configure srun to start a single instance of MATLAB and disable process pinning (task affinity), otherwise any parallel workers started from MATLAB would be pinned to the first core (thus oversubscribing it)
1
2
3
4
5
6
7
8
9
10
11
#!/bin/bash -l
#SBATCH -J SingleNodeParallelJob
#SBATCH -N 1
#SBATCH --ntasks-per-node=28
#SBATCH -c 1
#SBATCH --time=0-01:00:00
#SBATCH -p batch
#SBATCH --qos=normal

module load base/MATLAB
srun -n 1 --cpu-bind=no matlab -nodisplay -nosplash < /path/to/your/inputfile > /path/to/your/outputfile
  • Multi-node parallel application IntelMPI launcher, using 128 distributed cores:
1
2
3
4
5
6
7
8
9
10
#!/bin/bash -l
#SBATCH -J ParallelJob
#SBATCH -n 128
#SBATCH -c 1
#SBATCH --time=0-01:00:00
#SBATCH -p batch
#SBATCH --qos=normal

module load toolchain/intel
srun -n $SLURM_NTASKS /path/to/your/intel-toolchain-compiled-application
  • Multi-node parallel application OpenMPI launcher, using 128 distributed cores:
1
2
3
4
5
6
7
8
9
10
#!/bin/bash -l
#SBATCH -J ParallelJob
#SBATCH -n 128
#SBATCH -c 1
#SBATCH --time=0-01:00:00
#SBATCH -p batch
#SBATCH --qos=normal

module load toolchain/foss
srun -n $SLURM_NTASKS /path/to/your/foss-toolchain-compiled-application
  • Multi-node hybrid application IntelMPI+OpenMP launcher, using 28 threads per node on 10 nodes (280 cores):
1
2
3
4
5
6
7
8
9
10
11
12
#!/bin/bash -l
#SBATCH -J HybridParallelJob
#SBATCH -N 10
#SBATCH --ntasks-per-node=1
#SBATCH -c 28
#SBATCH --time=0-01:00:00
#SBATCH -p batch
#SBATCH --qos=normal

module load toolchain/intel
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun -n $SLURM_NTASKS /path/to/your/parallel-hybrid-app

Parallel launchers with license requests

  • Multi-node parallel application IntelMPI launcher, using 56 distributed cores and requesting 56 Allinea Performance Reports licenses:
    • you can check using scontrol show lic how many total licenses are available for forge (Allinea Forge: DDT/MAP) and perfreport (Allinea Performance Reports)
    • the below launcher runs your application under the performance report tool with its default options (will generate text/html reports)
1
2
3
4
5
6
7
8
9
10
11
12
#!/bin/bash -l
#SBATCH -J ProfilingJob
#SBATCH -n 56
#SBATCH -c 1
#SBATCH --time=0-01:00:00
#SBATCH -p batch
#SBATCH --qos=normal
#SBATCH -L perfreport:56

module load toolchain/intel
module load tools/AllineaReports
perf-report srun -n $SLURM_NTASKS /path/to/your/intel-toolchain-compiled-application

Checkpoint-restart using DMTCP

The following batch scripts are adapted from the official DMTCP launchers.

  • Launcher running user application under DMTCP, with periodic or manual checkpointing:
    • to be used when there is no previous checkpoint of your job
    • by default there is no automatic checkpoint, to enable it add the required interval within the start_coordinator -i 3600 line (checkpoint every hour)
    • for manual checkpointing you would use dmtcp_command -c, see the full options with dmtcp_command --help
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
#!/bin/bash -l
#SBATCH -J MyCheckpointedJob
#SBATCH --mail-type=end,fail
#SBATCH --mail-user=Your.Email@Address.lu
#SBATCH -N 4
#SBATCH --ntasks-per-node=28
#SBATCH --time=0-06:00:00
#SBATCH -p batch
#SBATCH --qos=normal

#----------------------------- Set up DMTCP environment for a job ------------#

###############################################################################
# Start DMTCP coordinator on the launching node. Free TCP port is automatically
# allocated.  This function creates a dmtcp_command.$JOBID script, which serves
# as a wrapper around dmtcp_command.  The script tunes dmtcp_command for the
# exact dmtcp_coordinator (its hostname and port).  Instead of typing
# "dmtcp_command -h <coordinator hostname> -p <coordinator port> <command>",
# you just type "dmtcp_command.$JOBID <command>" and talk to the coordinator
# for JOBID job.
###############################################################################

start_coordinator()
{
    ############################################################
    # For debugging when launching a custom coordinator, uncomment
    # the following lines and provide the proper host and port for
    # the coordinator.
    ############################################################
    # export DMTCP_COORD_HOST=$h
    # export DMTCP_COORD_PORT=$p
    # return

    fname=dmtcp_command.$SLURM_JOBID
    h=`hostname`

    check_coordinator=`which dmtcp_coordinator`
    if [ -z "$check_coordinator" ]; then
        echo "No dmtcp_coordinator found. Check your DMTCP installation and PATH settings."
        exit 0
    fi

    dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file $fname $@ 1>/dev/null 2>&1

    while true; do
        if [ -f "$fname" ]; then
            p=`cat $fname`
            if [ -n "$p" ]; then
                # try to communicate ? dmtcp_command -p $p l
                break
            fi
        fi
    done

    # Create dmtcp_command wrapper for easy communication with the coordinator.
    p=`cat $fname`
    chmod +x $fname
    echo "#!/bin/bash" > $fname
    echo >> $fname
    echo "export PATH=$PATH" >> $fname
    echo "export DMTCP_COORD_HOST=$h" >> $fname
    echo "export DMTCP_COORD_PORT=$p" >> $fname
    echo "dmtcp_command \$@" >> $fname

    # Set up local environment for DMTCP
    export DMTCP_COORD_HOST=$h
    export DMTCP_COORD_PORT=$p

}

#----------------------- Some rutine steps and information output -------------------------#

###################################################################################
# Print out the SLURM job information.  Remove this if you don't need it.
###################################################################################

echo "SLURM_JOBID="$SLURM_JOBID
echo "SLURM_JOB_NODELIST"=$SLURM_JOB_NODELIST
echo "SLURM_NNODES"=$SLURM_NNODES
echo "SLURMTMPDIR="$SLURMTMPDIR
echo "working directory = "$SLURM_SUBMIT_DIR

# changedir to workdir
cd $SLURM_SUBMIT_DIR

#----------------------------------- Set up job environment ------------------#

###############################################################################
# Load all nessesary modules or export PATH/LD_LIBRARY_PATH/etc here.
###############################################################################

module load tools/DMTCP

## If you use the FOSS toolchain (GCC, OpenMPI, etc.) uncomment the line below
# module load toolchain/foss

## If you use the Intel toolchain (compilers, MKL, IntelMPI) uncomment below
# module load toolchain/intel

## Add other modules below

#------------------------------------- Launch application ---------------------#

################################################################################
# 1. Start DMTCP coordinator - for periodic checkpointing uncomment `-i` below
################################################################################

start_coordinator # -i 3600 # ... <other dmtcp coordinator options here>

################################################################################
# 2. Launch application
################################################################################

## If your application uses IntelMPI or OpenMPI, uncomment below:
#srun -n $SLURM_NTASKS dmtcp_launch --rm /path/to/your/mpi-application

## If your application uses another MPI implementation, try:
#dmtcp_launch --rm mpirun -n $SLURM_NTASKS /path/to/your/mpi-application

## Non-MPI application checkpointing, if you use one of the MPI implementations
## above, then comment the line below:
dmtcp_launch /path/to/your/non-mpi-application
  • Launcher that restarts an application from a DMTCP checkpoint:
    • to be started in the same directory as the initial launch, and relies on the dmtcp_restart_script.sh generated at that step
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
#!/bin/bash -l
#SBATCH -J MyCheckpointedJob
#SBATCH --mail-type=end,fail
#SBATCH --mail-user=Your.Email@Address.lu
#SBATCH -N 4
#SBATCH --ntasks-per-node=28
#SBATCH --time=0-06:00:00
#SBATCH -p batch
#SBATCH --qos=normal

#----------------------------- Set up DMTCP environment for a job ------------#

###############################################################################
# Start DMTCP coordinator on the launching node. Free TCP port is automatically
# allocated.  This function creates a dmtcp_command.$JOBID script, which serves
# as a wrapper around dmtcp_command.  The script tunes dmtcp_command for the
# exact dmtcp_coordinator (its hostname and port).  Instead of typing
# "dmtcp_command -h <coordinator hostname> -p <coordinator port> <command>",
# you just type "dmtcp_command.$JOBID <command>" and talk to the coordinator
# for JOBID job.
###############################################################################

start_coordinator()
{
    ############################################################
    # For debugging when launching a custom coordinator, uncomment
    # the following lines and provide the proper host and port for
    # the coordinator.
    ############################################################
    # export DMTCP_COORD_HOST=$h
    # export DMTCP_COORD_PORT=$p
    # return

    fname=dmtcp_command.$SLURM_JOBID
    h=`hostname`

    check_coordinator=`which dmtcp_coordinator`
    if [ -z "$check_coordinator" ]; then
        echo "No dmtcp_coordinator found. Check your DMTCP installation and PATH settings."
        exit 0
    fi

    dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file $fname $@ 1>/dev/null 2>&1

    while true; do
        if [ -f "$fname" ]; then
            p=`cat $fname`
            if [ -n "$p" ]; then
                # try to communicate ? dmtcp_command -p $p l
                break
            fi
        fi
    done

    # Create a dmtcp_command wrapper for easy communication with the coordinator.
    p=`cat $fname`
    chmod +x $fname
    echo "#!/bin/bash" > $fname
    echo >> $fname
    echo "export PATH=$PATH" >> $fname
    echo "export DMTCP_COORD_HOST=$h" >> $fname
    echo "export DMTCP_COORD_PORT=$p" >> $fname
    echo "dmtcp_command \$@" >> $fname

    # Set up local environment for DMTCP
    export DMTCP_COORD_HOST=$h
    export DMTCP_COORD_PORT=$p

}

#----------------------- Some rutine steps and information output -------------------------#

###################################################################################
# Print out the SLURM job information.  Remove this if you don't need it.
###################################################################################

echo "SLURM_JOBID="$SLURM_JOBID
echo "SLURM_JOB_NODELIST"=$SLURM_JOB_NODELIST
echo "SLURM_NNODES"=$SLURM_NNODES
echo "SLURMTMPDIR="$SLURMTMPDIR
echo "working directory = "$SLURM_SUBMIT_DIR

# changedir to workdir
cd $SLURM_SUBMIT_DIR

#----------------------------------- Set up job environment ------------------#

###############################################################################
# Load all nessesary modules or export PATH/LD_LIBRARY_PATH/etc here.
###############################################################################

module load tools/DMTCP

## If you use the FOSS toolchain (GCC, OpenMPI, etc.) uncomment the line below
# module load toolchain/foss

## If you use the Intel toolchain (compilers, MKL, IntelMPI) uncomment below
# module load toolchain/intel

## Add other modules below

#------------------------------------- Launch application ---------------------#

################################################################################
# 1. Start DMTCP coordinator - for periodic checkpointing uncomment `-i` below
################################################################################

start_coordinator # -i 3600 # ... <other dmtcp coordinator options here>

################################################################################
# 2. Restart checkpointed application from DMTCP created restart script
################################################################################

/bin/bash ./dmtcp_restart_script.sh -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT

Application-specific launchers

ABAQUS

  • Launcher starting ABAQUS in distributed mode on 2 complete iris nodes
#!/bin/bash -l
#SBATCH -J AbaqusTest
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=28
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00:00
#SBATCH -p batch
#SBATCH --qos normal
#SBATCH -o %x-%j.log

### Load latest available ABAQUS
module load cae/ABAQUS

### Configure environment variables, need to unset SLURM's Global Task ID for ABAQUS's PlatformMPI to work
unset SLURM_GTIDS

### Create ABAQUS environment file for current job, you can set/add your own options (Python syntax)
env_file=abaqus_v6.env

cat << EOF > ${env_file}
#verbose = 3
#ask_delete = OFF
mp_file_system = (SHARED, LOCAL)
EOF

node_list=$(scontrol show hostname ${SLURM_NODELIST} | sort -u)

mp_host_list="["
for host in ${node_list}; do
    mp_host_list="${mp_host_list}['$host', ${SLURM_CPUS_ON_NODE}],"
done

mp_host_list=$(echo ${mp_host_list} | sed -e "s/,$/]/")

echo "mp_host_list=${mp_host_list}"  >> ${env_file}

### Set input file and job (file prefix) name here
job_name=${SLURM_JOB_NAME}
input_file=your_input_file.inp

### ABAQUS parallel execution
abaqus job=${job_name} input=${input_file} cpus=${SLURM_NTASKS} standard_parallel=all mp_mode=mpi interactive

Apache Spark

  • Launcher starting Apache Spark with distributed workers
    • 3 nodes are used, 83 cores on 3 workers (1 core less on the node hosting both master and 1 worker)
    • all node memory used, you may need to tune internal Spark settings for your use case
    • user code is started in the USER CODE EXECUTION section (this launcher is starting the Pi example)
    • two output files are created:
      • one log file with Spark daemons output (SparkJob-12345.log)
      • output file with the result of the user code execution (SparkJob-12345.out)
#!/bin/bash -l
#SBATCH -J SparkJob
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=28
#SBATCH --time=00:15:00
#SBATCH -p batch
#SBATCH --qos normal
#SBATCH -o %x-%j.log

### Load latest available Spark
module load devel/Spark

### If you do not wish tmp dirs to be cleaned
### at the job end, set below to 0
export SPARK_CLEAN_TEMP=1

### START INTERNAL CONFIGURATION

## CPU and Memory settings
export SPARK_WORKER_CORES=${SLURM_CPUS_PER_TASK}
export DAEMON_MEM=4096
export NODE_MEM=$((4096*${SLURM_CPUS_PER_TASK}-${DAEMON_MEM}))
export SPARK_DAEMON_MEMORY=${DAEMON_MEM}m
export SPARK_NODE_MEM=${NODE_MEM}m

## Set up job directories and environment variables
export SPARK_JOB="$HOME/spark-jobs/${SLURM_JOBID}"
mkdir -p "${SPARK_JOB}"

export SPARK_HOME=$EBROOTSPARK
export SPARK_WORKER_DIR=${SPARK_JOB}
export SPARK_LOCAL_DIRS=${SPARK_JOB}
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=9080
export SPARK_SLAVE_WEBUI_PORT=9081
export SPARK_INNER_LAUNCHER=${SPARK_JOB}/spark-start-all.sh
export SPARK_MASTER_FILE=${SPARK_JOB}/spark_master

export HADOOP_HOME_WARN_SUPPRESS=1
export HADOOP_ROOT_LOGGER="WARN,DRFA"

export SPARK_SUBMIT_OPTIONS="--conf spark.executor.memory=${SPARK_NODE_MEM} --conf spark.python.worker.memory=${SPARK_NODE_MEM}"

## Generate spark starter-script
cat << 'EOF' > ${SPARK_INNER_LAUNCHER}
#!/bin/bash
## Load configuration and environment
source "$SPARK_HOME/sbin/spark-config.sh"
source "$SPARK_HOME/bin/load-spark-env.sh"
if [[ ${SLURM_PROCID} -eq 0 ]]; then
    ## Start master in background
    export SPARK_MASTER_HOST=$(hostname)
    MASTER_NODE=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1)

    echo "spark://${SPARK_MASTER_HOST}:${SPARK_MASTER_PORT}" > "${SPARK_MASTER_FILE}"

    "${SPARK_HOME}/bin/spark-class" org.apache.spark.deploy.master.Master \
        --ip $SPARK_MASTER_HOST                                           \
        --port $SPARK_MASTER_PORT                                         \
        --webui-port $SPARK_MASTER_WEBUI_PORT &

    ## Start one slave with one less core than the others on this node
    export SPARK_WORKER_CORES=$((${SLURM_CPUS_PER_TASK}-1))
    "${SPARK_HOME}/bin/spark-class" org.apache.spark.deploy.worker.Worker \
       --webui-port ${SPARK_SLAVE_WEBUI_PORT}                             \
       spark://${MASTER_NODE}:${SPARK_MASTER_PORT} &

    ## Wait for background tasks to complete
    wait
else
    ## Start (pure) slave
    MASTER_NODE=spark://$(scontrol show hostname $SLURM_NODELIST | head -n 1):${SPARK_MASTER_PORT}
    "${SPARK_HOME}/bin/spark-class" org.apache.spark.deploy.worker.Worker \
       --webui-port ${SPARK_SLAVE_WEBUI_PORT}                             \
       ${MASTER_NODE}
fi
EOF
chmod +x ${SPARK_INNER_LAUNCHER}

## Launch SPARK and wait for it to start
srun ${SPARK_INNER_LAUNCHER} &
while [ -z "$MASTER" ]; do
	sleep 5
	MASTER=$(cat "${SPARK_MASTER_FILE}")
done
### END OF INTERNAL CONFIGURATION

### USER CODE EXECUTION
OUTPUTFILE=${SLURM_JOB_NAME}-${SLURM_JOB_ID}.out
spark-submit ${SPARK_SUBMIT_OPTIONS} --master $MASTER $SPARK_HOME/examples/src/main/python/pi.py 1000 > ${OUTPUTFILE}

### FINAL CLEANUP
if [[ -n "${SPARK_CLEAN_TEMP}" && ${SPARK_CLEAN_TEMP} -eq 1 ]]; then
    echo "====== Cleaning up: SPARK_CLEAN_TEMP=${SPARK_CLEAN_TEMP}"
    rm -rf ${SPARK_JOB}
else
    echo "====== Not cleaning up: SPARK_CLEAN_TEMP=${SPARK_CLEAN_TEMP}"
fi