Slurm scheduler

Roar uses Slurm (Simple Linux Utility for Resource Management) to schedule submitted jobs and allocate compute resources.

Resource directives

Resource directives specify resources needed by a job, including the hardware to use (nodes, cores, GPUs, memory) and the run time. They are required for both interactive jobs and batch jobs. Interactive jobs via the Portal can also use resource directives.

The most common directives are:

Short option	Long option	Description
`-J`	`--job-name`	name the job
`-A`	`--account`	charge to an account
`-p`	`--partition`	request a partition
`-N`	`--nodes`	number of nodes
`-n`	`--ntasks`	number of tasks (cores)
NA	`--ntasks-per-node`	number of tasks per node
NA	`--mem`	memory per node
NA	`--mem-per-cpu`	memory per core
`-t`	`--time`	maximum run time
NA	`--gres`	GPU request
`-C`	`--constraint`	required node features only for paid accounts
`-e`	`--error`	direct standard error to a file
`-o`	`--output`	direct standard output to a file

Environment variables

Slurm defines environment variables within the scope of a job:

Environment Variable	Description
`SLURM_JOB_ID`	ID of the job
`SLURM_JOB_NAME`	Name of job
`SLURM_NNODES`	Number of nodes
`SLURM_NODELIST`	List of nodes
`SLURM_NTASKS`	Total number of tasks
`SLURM_NTASKS_PER_NODE`	Number of tasks per node
`SLURM_QUEUE`	Queue (partition)
`SLURM_SUBMIT_DIR`	Directory of job submission

Replacement symbols

Replacement symbols can be used in Slurm directives, to build job names and filenames with information specific to the job being run:

Symbol	Description
`%j`	Job ID
`%x`	Job name
`%u`	Username
`%N`	Hostname where the job is running

For more information on Slurm directives, environment variables, and replacement symbols, see Slurm sbatch documentation for batch jobs and Slurm salloc documentation for interactive jobs.

Job output files

By default, batch job standard output and standard error are both directed to slurm-%j.out, where %j is the jobID. But output and error filenames can be customized: #SBATCH -e = <file> redirects standard error to <file>, and #SBATCH -o likewise redirects standard output.

SLURM variables %x (job name) and %u (username) are useful for this purpose. For example,

#SBATCH -eo = %u_%x.out

writes both standard output and error to <username>_<jobname>.out.