// Monitoring Jobs on Discover using slurm 

Query jobs using squeue

To see the status of your job, "squeue" queries the current job queue and lists its contents. Useful options include:

-a which lists all jobs
-t R which lists all running jobs
-t PD which lists all pending (non-running) jobs
-p datamove which lists all jobs in the datamove partition
-j [jobid] which lists only your job
--user=[userid] which lists the jobs submitted by [userid]
--start --user=[userid] which lists the jobs submitted by [userid], with their current start-time estimates (as available)

Although squeue is a convenient command to query the status of jobs and queues, please be careful not to issue the command excessively, for example, invoking the query for the status of a job every five seconds or so using a script after a job is submitted.

Using squeue to emulate qstat output

The squeue output format is completely customizable by using a printf-style formatting string. If you prefer the PBS qstat-like format, you can put the following in your .profile, or .login to set the SQUEUE_FORMAT variable every time you log in. Then, you can just type squeue and it will format the output based on the contents of the SQUEUE_FORMAT environment variable.

export SQUEUE_FORMAT="%.15i %.25j %.8u %.10M %.2t %.9P"

man squeue will give you more information about the squeue formatting options. If there's something specific you'd like to see, let us know and we can help determine the correct format string to display it for you.

Monitoring job output and error files

While your batch job is running, you will be able to monitor the standard error/output file. By default, Slurm writes standard output stdout and stderr into a single file. To separate the stderr from stdout, specify:

#SBATCH --output=jobname-%j.out #SBATCH --error=jobname-%j.err

If you do not give an output filename, the default file is slurm-$SLURM_JOB_ID.out. Be careful not to rename or move the stdout/stderr files while the job is still executing.

If you are using srun to perform multiple job steps simultaneously, which direct their outputs to different files, see the man page for sattach. Feel free to contact NCCS support for assistance.

Attach to a running job

srun --jobid=<SLURM_JOBID> --pty bash #or any interactive shell

This command will place your shell on the head node of the running job (job in an "R" state in squeue). From there you can run top/htop/ps or debuggers to examine the running work. If the job has more than a single node, you can ssh from the head node to the other nodes in the job (See the "SLURM_JOB_NODELIST" environment variable or squeue output for the list of nodes assigned to a job). Exiting from the shell will exit the srun command and return your shell to the original login node session.