Jobs with Slurm

The Roar computing clusters are shared computational resources. To perform computationally intensive tasks, users must request compute resources and be provided access to those resources. The request/provision process allows the tasks of many users to be scheduled and carried out efficiently to avoid resource contention. Slurm is utilized by Roar as the job scheduler and resource manager. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters. Slurm is rapidly rising in popularity and many other HPC systems use Slurm as well. Its primary functions are to

Allocate access to compute resources to users for some duration of time
Provide a framework for starting, executing, and monitoring work on the set of allocated compute resources
Arbitrate contention for resources by managing a queue of pending work

Be kind to the submit nodes

Submit nodes provide vital access to the cluster. Do not perform computationally intensive tasks on submit nodes. For interactive work, request an Interactive Job first.

Slurm Resource Directives

Resource directives are used to specify compute resources when submitting a job request to the scheduler. They allow you to indicate the run time, amount of memory, and the number of cores your job will use. They are required when launching Interactive Jobs and Batch Jobs through the command line interface.

Interactive Jobs Through the Roar Portal also use resource directives through the drop down menus in the portal interface.

The following table lists some of the most commonly used resource directives.

Short option	Long option	Description
`-J`	`--job-name`	Specify a name for the job
`-A`	`--account`	Charge resources used by a job to a specified account
`-p`	`--partition`	Request a partition for the resource allocation
`-N`	`--nodes`	Request a number of nodes
`-n`	`--ntasks`	Request a number of tasks
NA	`--ntasks-per-node`	Request a number of tasks per allocated node
NA	`--mem`	Specify the amount of memory required per node
NA	`--mem-per-cpu`	Specify the amount of memory required per CPU
`-t`	`--time`	Set a limit on the total run time
`-C`	`--constraint`	Specify any required node features This feature is only available to paid account holders
`-e`	`--error`	Connect script's standard error to a non-default file
`-o`	`--output`	Connect script's standard output to a non-default file
NA	`--requeue`	Specify that the batch job should be eligible for requeuing
NA	`--exclusive`	Require exclusive use of nodes reserved for job

Details on additional directives and job options can be found in the Slurm sbatch documentation for batch jobs and the Slurm salloc documentation for interactive jobs.

Replacement Symbols

Replacement symbols can be used inside of slurm directives to populate the values with dynamic content for each job, such as the Job ID or the hostname of the node the job is running on.

Symbol	Description
`%j`	Job ID
`%x`	Job name
`%u`	Username
`%N`	Hostname where the job is running
`%A`	Job array's master job allocation number
`%a`	Job array ID (index) number

Example: Both standard output and standard error are directed to the same file by default, and the file name is slurm-%j.out, where the %j is replaced by the job ID. The output and error filenames are customizable, however, using the table of symbols above.

Additional details on Replacement Symbols can be found in the Slurm sbatch documentation for batch jobs and the Slurm salloc documentation for interactive jobs.

Environmental Variables

Slurm makes use of environment variables within the scope of a job, and utilizing these variables can be beneficial in many cases.

Environment Variable	Description
`SLURM_JOB_ID`	ID of the job
`SLURM_JOB_NAME`	Name of job
`SLURM_NNODES`	Number of nodes
`SLURM_NODELIST`	List of nodes
`SLURM_NTASKS`	Total number of tasks
`SLURM_NTASKS_PER_NODE`	Number of tasks per node
`SLURM_QUEUE`	Queue (partition)
`SLURM_SUBMIT_DIR`	Directory of job submission

Additional details on Environmental Variables used by Slurm can be found in the Slurm sbatch documentation for batch jobs and the Slurm salloc documentation for interactive jobs.

Hardware details

sinfo

The SLURM command sinfo displays information about all Collab nodes. Its output is more easily read with some formatting options,

sinfo --Format=features:30,nodelist:20,cpus:5,memory:10,gres:30

An example output of the sinfo command would look like:

$ sinfo --Format=features:30,nodelist:20,cpus:5,memory:10,gres:30
AVAIL_FEATURES                NODELIST            CPUS MEMORY    GRES
standard,a100,cascadelake     p-gc-[3001-3035,303848   380000    gpu:a100:2(S:0-11,36-47)
standard,a100_3g,mig,cascadelap-gc-3036           48   380000    gpu:a100_3g:4(S:0-11,36-47)

Details on additional sinfo options can be found in the Slurm sinfo documentation.

Node attributes

Roar Collab contains a wide variety of different hardware configurations. To find out specifics about the hardware on different nodes, there are several helpful tools.

lscpu: displays information about the CPU and its capabilities
nvidia-smi: displays information about the GPU (if present).

Using these commands within your jobs can provide details on the hardware it is running on.