// USING SLURM 

NCCS provides SchedMD's Slurm resource manager for users to control their scientific computing jobs and workflows on the Discover supercomputer. This video gives instructions on how users can submit jobs to be scheduled, specifying resource requests such as CPU time, memory, as well as other options for optimizing productivity.

Use Slurm commands to request both interactive and batch access to Discover computational resources.

Reference NCCS online man pages first for full documentation of the currently installed version of all Slurm commands.

Additional Documentation can be found on the Slurm website.

Additional documentation

Note: May reflect a different version.


// Key Commands

Command Application
sbatch submit a batch job script for queueing and execution
salloc/xalloc submit an interactive job request
srun run a command within an existing job, on a subset of allocated resources
scancel cancel a queued or running job
squeue query the status of your job(s) or the job queue

NCCS Quality of Service

Slurm's Quality of Service (QoS) feature controls resource limits for every job in the Discover job queue.

QOS


// USING SLURM ON DISCOVER

// Running Jobs on Discover using Slurm 

Submit a job

In general, you will create a batch job script. Either a shell script or a Python script is allowed, but throughout the user guide we use only shell scripts for demonstration. You will then submit the batch script using sbatch; the following example requests 2 nodes with at least 2GBs of memory per CPU:

$ sbatch --nodes=2 --mem-per-cpu=2048 [my_script_file]

"my_script_file" should include all the necessary requirements for a batch script. Look at the example job scripts here for more information.

At bottom is a subset of sbatch environment variables which may be most useful to NCCS users.

See How to Determine Memory Usage for various methods of calculating your job memory requirements.

Note: The "sbatch" command scans the lines of the script file for SBATCH directives. A line in the script file will be processed as a directive to "sbatch" if and only if the string of characters starts with #SBATCH (with no preceding blanks). The remainder of the directive line consists of the options to "sbatch" in the same syntax as they appear on the command line.

The sbatch command reads down the shell script until it finds the first line that is not a valid SBATCH directive, then stops. The rest of the script is the list of commands or tasks that the user wishes to run.

There are many options to the "sbatch" command. The table lists a few commonly used options. Please refer to the man pages on Discover for additional details.

SBATCH OPTIONS

Submit an interactive job

Use the salloc command to request interactive Discover resources through Slurm. The following command gives you a 3-node job allocation, and places you in a shell session on its head node. Your terminal bell will ring to notify you when you receive your job allocation:

$ salloc --nodes=3 --bell

The options described in the sbatch link below also apply to salloc.

Here is a subset of salloc environment variables which may be most useful to NCCS users.

SBATCH OPTIONS

Submit an interactive job with X 11 forwarding

The following xalloc command (an NCCS wrapper for salloc) sets up X11 forwarding and starts a shell on the job's head node, while the --ntasks argument lets Slurm allocate any number of nodes to the job that together can provide 56 cores:

$ xalloc --ntasks=56

The xalloc wrapper forwards all options to salloc. Please note: both salloc and xalloc will place you on the head node of a multi-node allocation.

Submit a set of jobs

Using: --depend

Quite often, users may want to execute multiple long runs which must be processed in sequence. SBATCH job dependencies allow you to prevent a job from running until another job has completed one of several actions (started running, completed, etc...).

SBATCH allows users to move the logic for job chaining from the script into the scheduler. The format of a SBATCH dependency directive is -d, --dependency=dependency_list , where dependency_list is of the form: type:job_id[:job_id][,type:job_id[:job_id]] For example,

$ sbatch --dependency=afterok:523568 secondjob.sh

only schedules the second job after the Job 523568 was successfully completed. Useful "types" in the dependency expression are

  • afterok: Job is scheduled if the Job exits without errors or is successfully completed.
  • afternotok: Job is scheduled if the Job exits with errors.
  • afterany: Job is scheduled if the Job exits with or without errors.
  • after: Job is scheduled if the Job has started.

A simple example below shows how to submit three batch jobs with dependencies, using the following driver script:

#!/bin/bash
FIRST=$(sbatch testrun.sh | cut -f 4 -d' ')
echo $FIRST
SECOND=$(sbatch -d afterany:$FIRST testrun.sh | cut -f 4 -d' ')
echo $SECOND
THIRD=$(sbatch -d afterany:$SECOND testrun.sh | cut -f 4 -d' ')
echo $THIRD
exit 0

The SECOND job will be on hold until the FIRST job exits with or without error. The THIRD job will be on hold until the SECOND job completes.

The second example below shows a script to submit a job "job2.slurm" but the job will not be queued until the current running or queued jobs are all completed with no errors.

#!/usr/bin/csh -fx
# Query all my jobs (squeue -u) and reformat the job ids into
# a string with the form: Job-ID1: Job-ID2: Job-ID3…
set arglist = `squeue -u myuserid --noheader -o "%i" | \
sed -n '1h;2,$H;${g;s/\n/:/g;p}'`
sbatch -d after:$arglist job2.slurm
exit 0

Using: --nice

The sbatch "nice" option can be assigned a value of 1 to 10000, where 10000 is the lowest available priority. (This value specifies a scheduling preference among a set of jobs, but it is still possible for Slurm's backfill algorithm to run a lower-priority job before a higher priority job. For strict job ordering, use --depend as described above.) To specify a low-priority job in relationship to a higher-priority, important job:

$ sbatch ./important.sh
$ sbatch --nice=10000 ./low_priority.sh

Submit replicated jobs

A job array represents a collection of subjobs (also referred to as job array "tasks") which only differ by a single index parameter. Sometimes users may want to submit many similar jobs based on the same job script. Rather than using a script or multiple similar scripts and repeatedly calling sbatch, a job array allows the creation of multiple such subjobs within one job script. Therefore, it offers users a mechanism for grouping related work, making it possible to not only submit, but also query, modify and display the set as a single unit. (See additional job-array monitoring and control methods in man pages for squeue, scontrol, and scancel.)

Job arrays are submitted through the -a or --array option to sbatch. This option specifies multiple values using a comma separated list and/or a range of values with a "-" separator. An example job array script "testArrayjobs.sh" is shown below.

#!/bin/bash
#SBATCH -N Test_Arrayjobs
#SBATCH --ntasks-per-node=12 --ntasks=24
#SBATCH --constraint=hasw
#SBATCH --time=12:00:00
#SBATCH -o output.%A_%a
#SBATCH --account=xxxx
#SBATCH --array=0-4:2
. /usr/share/modules/init/bash
module purge
module load comp/intel-12.1.0.233 mpi/impi-4.1.0.024
cd /discover/nobackup/myuserid
mkdir -p work.${SLURM_ARRAY_TASK_ID}
cd work.${SLURM_ARRAY_TASK_ID}
cp /discover/nobackup/myuserid/myexec .
echo ${SLURM_ARRAY_TASK_ID} > input.${SLURM_ARRAY_TASK_ID}
mpirun -np 24 ./myexec exit 0

In this case, "sbatch testArrayjobs.sh" submits a job array including three subjobs, each with a unique sequence number 0, 2, and 4 defined by "-a" or "--array=0-4:2". Each of the 3 subjobs are running in different work directories: work.0, work.2, or work.4. Each subjob runs on 2 nodes and 24 MPI tasks, reading in different input files and writing standard output to output.jobid_array_task_index. Note that "%A" is replaced by the job ID and "%a" with the array task ID.

$ squeue | grep myuserid
524230_0 compute Test_Arra myuserid R 0:03 2 borgr[019-020]
524230_2 compute Test_Arra myuserid R 0:03 2 borgr[001-002]
524230_4 compute Test_Arra myuserid R 0:03 2 borgr[003-004]

Environment variables for sbatch, salloc, and srun

Slurm provides extensive automation and customization capabilities for each of its commands, through a set of environment variables. We link to the essential subsets of environment variables for sbatch, salloc in the examples below, and recommend these links as starting points. If these are inadequate for your needs, then for a complete, detailed list of environment variables, see the man page for each command, or contact NCCS support.

SBATCH

SALLOC

SRUN

Learn how to use bash and c-shell scripts to automate jobs, using example scripts as a reference.

// Monitoring Jobs on Discover using slurm 

Query jobs using squeue

To see the status of your job, "squeue" queries the current job queue and lists its contents. Useful options include:

-a which lists all jobs
-t R which lists all running jobs
-t PD which lists all pending (non-running) jobs
-p datamove which lists all jobs in the datamove partition
-j [jobid] which lists only your job
--user=[userid] which lists the jobs submitted by [userid]
--start --user=[userid] which lists the jobs submitted by [userid], with their current start-time estimates (as available)

Although squeue is a convenient command to query the status of jobs and queues, please be careful not to issue the command excessively, for example, invoking the query for the status of a job every five seconds or so using a script after a job is submitted.

Using squeue to emulate qstat output

The squeue output format is completely customizable by using a printf-style formatting string. If you prefer the PBS qstat-like format, you can put the following in your .profile, or .login to set the SQUEUE_FORMAT variable every time you log in. Then, you can just type squeue and it will format the output based on the contents of the SQUEUE_FORMAT environment variable.

export SQUEUE_FORMAT="%.15i %.25j %.8u %.10M %.2t %.9P"

man squeue will give you more information about the squeue formatting options. If there's something specific you'd like to see, let us know and we can help determine the correct format string to display it for you.

Monitoring job output and error files

While your batch job is running, you will be able to monitor job standard error/output file. By default, Slurm writes standard output stdout and stderr into a single file. To separate the stderr from stdout, specify:

#SBATCH --output=jobname-%j.out #SBATCH --error=jobname-%j.err

If you do not give an output filename, the default file is slurm-$SLURM_JOB_ID.out. Be careful not to rename or move the stdout/stderr files while the job is still executing.

If you are using srun to perform multiple job steps simultaneously, which direct their outputs to different files, see the man page for sattach. Feel free to contact NCCS support for assistance.

// Killing Jobs on Discover using slurm 

Cancel a pending or running job

To delete a job, use "scancel" followed by the job ID. For example:

$ scancel 1033320

Cancel all of your pending and running jobs

To delete all your jobs across all partitions simultaneously, in case they are mistakenly submitted, use:

$ scancel --user=myuserid

The --user option terminates all of your jobs, both pending and running.

// Slurm Example Scripts

Serial Job Script

#SBATCH --job-name=Serial_Test_Job
#SBATCH --ntasks=1 --constraint=hasw
#SBATCH --time=1:00:00
#SBATCH -o output.%j
#SBATCH -e error.%j
#SBATCH --qos=debug
#SBATCH --account=xxxx
#SBATCH --workdir=/discover/nobackup/myuserid
./myexec
exit 0

By default, Slurm executes your job from the current directory where you submit the job. You can change the work directory by "cd" to it in the script, or specify --workdir option for SBATCH.

OPENMP Job Script

#!/usr/bin/bash #SBATCH -J Test_Slurm_Job
#SBATCH --ntasks=1 --cpus-per-task=6 --constraint=hasw
#SBATCH --time=1:00:00
#SBATCH -o output.%j
#SBATCH --account=xxxx
#SBATCH --workdir=/discover/nobackup/myuserid
export OMP_NUM_THREADS=6
# the line above is optional if "--cpus-per-task=" is set
export OMP_STACKSIZE=1G
export KMP_AFFINITY=scatter
./Test_omp_executable
exit 0

Note: The option "--cpus-per-task=n" advises the Slurm controller that ensuring job steps will require "n" number of processors per task. Without this option, the controller will just try to allocate one processor per task. Even when "--cpus-per-task" is set, you can still set OMP_NUM_THREADS explicitly with a different number as long as it does not exceed requested resource.

MPI/OPENMP Hybrid Job Script

#!/usr/bin/csh
#SBATCH -J Test_Job
#SBATCH --nodes=4 --ntasks=24 --cpus-per-task=2 --ntasks-per-node=6
#SBATCH --constraint=hasw
#SBATCH --time=12:00:00
#SBATCH -o output.%j
#SBATCH --account=xxxx
source /usr/share/modules/init/csh
module purge
module load comp/intel-13.1.3.192 mpi/impi-4.1.0.024
cd $SLURM_SUBMIT_DIR
setenv OMP_NUM_THREADS 2
setenv OMP_STACKSIZE 1G
setenv KMP_AFFINITY compact
setenv I_MPI_PIN_DOMAIN auto
mpirun -perhost 6 -np 24 ./Test_executable


// APPLICATIONS OF SLURM 

// Slurm Best Practices on Discover

The following approaches allow Slurm's advanced scheduling algorithm the greatest flexibility to schedule your job to run as soon as possible.

Learn how to use haswell nodes to submit a slurm job.

Feel free to experiment with these, or contact support@nccs.nasa.gov, and we'll be happy to customize our recommendations for your specific use case.

DON'T

The guiding principle here is to specify only what's necessary, to give yourself the best and earliest chance of being scheduled.

  • Don't specify any partition, unless you are trying to access specialized hardware, such as datamove or co-processor nodes. Since the default partition may need to change over time, eliminating such specifications will minimize required changes in your job scripts in the future.
  • Don't specify any processor architecture (e.g. "sand" or "hasw"), if your job can run on either Sandy Bridge or Haswell nodes. NCCS's Slurm configuration ensures that each job will only run on one type of processor architecture.

DO

The guiding principle here is to specify complete, accurate, and flexible resource requirements:

Time Limit

Specify both a preferred maximum time limit, and a minimum time limit as well, if your workflow performs self-checkpointing. In this example, if you know that your job will save its intermediate results within the first 4 hours, these specifications will cause Slurm to schedule your job in the earliest available time window of 4 hours or longer, up to 12hrs: #SBATCH --time=12:00:00
#SBATCH --time-min=04:00:00

Alternatively, specify as low a time limit as will realistically allow your job to complete; this will enhance your job's opportunity to be backfilled: #SBATCH --time=

Memory Limits

Specify memory requirements explicitly, either as memory per node, or as memory per CPU: #SBATCH --mem=12G
or
#SBATCH --mem-per-cpu=3G

The following combination of options will let Slurm run your job on any combination of nodes (all of the same type - Sandy Bridge or Haswell) that has an aggregate core count of at least 256, and aggregate total memory of at least 512G: #SBATCH --mem-per-cpu=2G
#SBATCH --ntasks=256

Node Requirements

Specify a range of acceptable node counts. This example tells the scheduler that the job can use anywhere from 128 to 256 nodes: (NOTE: Your job script must then launch the appropriate number of tasks, based on how many nodes you are actually allocated.) #SBATCH --nodes=128-256

Specify the minimum number of cpus per node that your application requires. With this example, your application will run on any available node with 16 or more cores available: #SBATCH --mincpus=16

To flexibly request large memory nodes, you could specify a node range, maximum number of tasks (if you receive the maximum node count you request), and total memory needed per node. For example, for an application that can run on anywhere from 20-24 nodes, needs 8 cores per node, and uses 2G per core, you could specify the following: #SBATCH --nodes=20-24
#SBATCH --ntasks-per-node=8
#SBATCH --mem=16G

In the above, Slurm understands --ntasks to be the maximum task count across all nodes. So your application will need to be able to run on 160, 168, 176, 184, or 192 cores, and will need to launch the appropriate number of tasks, based on how many nodes you are actually allocated.



Using Haswell Nodes to submit a slurm job

Use the proper directives. From the command line: $ sbatch --constraint=hasw jobscript

inline directives: #SBATCH --constraint=hasw

It is always a good practice to ask for resources in terms of cores or tasks, rather than number of nodes. For example 10 Haswell nodes could run 280 tasks on 280 cores.

The wrong way to ask for the resources: #SBATCH --nodes=10

The right way to ask for resources: #SBATCH --ntasks=280

If you should need more memory per task and, therefore, use fewer cores per node, use the following (note: memory below is in Megabytes): #SBATCH --ntasks-per-node=N
#SBATCH --mem-per-cpu=M