// Slurm on ADAPT

In order to fairly distribute user jobs across shared resources, some of our VM clusters on ADAPT are equipped with Slurm. With Slurm, users can run both interactive and non-interactive jobs on specified resources without having to worry about interference from other user workloads. When resources aren’t readily available, Slurm also will queue your jobs so that as resources become available, your jobs will be able to gain allocation without needing you to monitor the state of the cluster and queue. This guide will help you get started working with Slurm on ADAPT, and further information can be found the official Slurm documentation. If you have further questions or are experiencing issues with Slurm on the cluster, please contact NCCS User Support by opening a ticket.

Commands

There are three ways you can gain access to resources with Slurm:

sbatch: sbatch is used to submit job scripts. These jobs will run non-interactively and by default will log output to a file.

srun: srun is used to designate resources to parallel tasks, typically run via a job script.

salloc: salloc can be used to allocate resources to execute a command and is typically used for interactive Slurm jobs.

There are also commands you can use to monitor the status of the cluster, the queue, and your jobs:

squeue: Will display the queue and the status of jobs submitted.

sinfo: Shows the available nodes in the cluster and their availability.

scancel: Allows you to kill your jobs before their completion.

For more information regarding use, all of these commands can be run with the "--help" flag or with "man CommandName".

Common Job Flags

Although the function of the flags for various job commands may differ slightly, these are a few of the most common.

-t <Time>: Time limit for your allocation. Acceptable formats include ‘-t ’, ‘-t HH:MM:SS’,’ -t D-HH:MM:SS’

-c <CPUs>: Total number of CPU cores to allocate per task.

--mem <Memory>: Minimum amount of memory to allocate per node.

-G <GPUs>: Total number of GPUs to allocate. Note that most of our systems in ADAPT don’t have GPUs available. For more information on our GPU resources on ADAPT, check out the PRISM cluster.

-N <Nodes>: Number of nodes to allocate resources on.

-n <Tasks>: The number of concurrent tasks to run.

-J <Job Name>: Set the name of your job.

Examples

There are two types of jobs you will be running through slurm. Interactive and non-interactive. Here are some examples that you can use to get started submitting work to the scheduler.

Interactive Jobs: A job that is running a shell as opposed to a script or other command. By default, ‘salloc’ will start a shell on allocated resources. salloc -c2 --mem=10G

Non-Interactive Jobs: A job for running scripts or other commands. An example may look like the following: submit.sh

#!/bin/bash



  # Using ‘#SBATCH’ will allow you to set sbatch flags within your submission script.

  # The following will allocate two nodes for your job with 10 CPU cores and 50 GB of memory on each node.

  #SBATCH -c10 –mem=50G -n2 -N2

  #SBATCH -J MyNonInteractiveJob



  # We can load modules in the script.

  module load anaconda

  conda activate



  # Tasks can be run in parallel with ‘srun’, ‘&’, and ‘wait’.

  srun -n1 -N1 task_1.sh &

  srun -n1 -N1 python task_2.py &

  wait



  # Tasks can also be run in serial.

  srun -n1 -N1 task_1.sh

  srun -n1 -N1 task_2.sh

Once you have written your script, it can be submitted with the following: sbatch submit.sh

To learn more about using GPUs through SLURM on ADAPT, see:

GPUs on ADAPT

NASA Center for Climate Simulation

High Performance Computing for Science

Alert

// Slurm on ADAPT

Commands

Common Job Flags

Examples

NCCS CONTACTS

ABOUT US

CONTACT US