// Memory Monitoring

The following methods allows user to monitor memory in a dynamic range of requirements:


Monitor Interactive Jobs

Slurm provides a special kind of batch job called interactive-batch. An interactive-batch job is treated just like a regular batch job (in that it is queued up, and has to wait for resources to become available before it can run). Once it is started, however, the user's terminal input and output are connected to the job in a manner similar to a login session. It appears that the user is logged into one of the available execution machines (i.e., compute nodes), and the resources requested by the job are reserved for that job. Many users find this useful for debugging their applications or for computational steering. Submit an interactive-batch job using the following command:

$ salloc

To specify the total number of CPUs and wallclock time, you may include those options at the command line. For example, suppose you wanted 16 CPUs for a total of 4 hours to run some interactive work, do:

$ salloc -N 2 -n 16 --mem-per-cpu=2048 -t 4:00:00

or

$ xalloc -N 2 -n 16 --mem-per-cpu=2048 -t 4:00:00

The above commands request 2 nodes from the default partition, and each node should have a minimum of 2048MB of memory per CPU. This will allow you to run your MPI application via mpirun with anywhere from 1 to 16 processes across 1 to 2 nodes.

Note: xalloc passes DISPLAY environment into salloc so that you can open an xterm on the compute nodes.

Whether you use salloc or xalloc, your job may not be started immediately, but will start when sufficient resources become available. Once the 2 nodes requested become available to you, you will be taken onto one of the hosts. Then you type the commands, e.g.,

$ cd $SLURM_SUBMIT_DIR
$ mpirun -np 24 <executable>

With the program starts and goes through its initialization phase (reading data from disk, etc.), you can run the "top" command or look at the file /proc/meminfo periodically. You can either type "top" and run it interactively (type 'q' to quit), or run it non-interactively with 'top -bn1'. The output may look like:

Cpu(s): 1.9%us, 0.5%sy, 0.0%ni, 97.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.1%st
Mem: 128964M total, 124099M used, 4864M free, 103M buffers
Swap: 8193140k total, 0k used, 8193140k free, 380888k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29018 myuserid 20 0 485m 50m 21m D 101 0.2 0:59.30 wrf.exe
29013 myuserid 20 0 488m 53m 21m D 99 0.2 0:58.97 wrf.exe
29016 myuserid 20 0 488m 52m 21m D 99 0.2 0:58.97 wrf.exe
29019 myuserid 20 0 524m 162m 93m D 99 0.7 0:31.00 wrf.exe
29021 myuserid 20 0 486m 51m 21m D 99 0.2 0:59.24 wrf.exe
29022 myuserid 20 0 486m 52m 22m R 99 0.2 0:58.57 wrf.exe
29012 myuserid 20 0 487m 51m 20m D 97 0.2 0:59.29 wrf.exe
29014 myuserid 20 0 485m 49m 20m R 97 0.2 0:59.38 wrf.exe
29015 myuserid 20 0 486m 51m 21m R 97 0.2 0:59.13 wrf.exe
29017 myuserid 20 0 486m 51m 21m D 97 0.2 0:59.11 wrf.exe
29023 myuserid 20 0 486m 52m 21m D 97 0.2 0:59.27 wrf.exe
29020 myuserid 20 0 486m 50m 20m D 95 0.2 0:58.96 wrf.exe
1 root 20 0 1068 392 324 S 0 0.0 0:07.06 init
2 root 15 -5 0 0 0 S 0 0.0 0:00.00 kthreadd
3 root RT -5 0 0 0 S 0 0.0 0:00.04 migration/0
4 root 15 -5 0 0 0 S 0 0.0 2:29.41 ksoftirqd/0
5 root RT -5 0 0 0 S 0 0.0 0:00.06 migration/1

The first block of the "top" outputs general information about the state of the node, including cpu usage, memory usage, and swap activity.

In the second block of the "top" output, the fifth column 'VIRT' is the total amount of virtual memory used by the process, which includes the code, data and shared libraries plus pages that have been swapped out. Note that VIRT = SWAP + RES, in which 'SWAP' is the swapped out portion of a task's total virtual memory image. The sixth column 'RES' is the Resident size, which is the non-swapped physical memory a process has used. The seventh column 'SHR' is the amount of shared memory used by a task; it simply reflects memory that could be potentially shared with other processes.

Note: Slurm monitors and enforces specified memory limits against RES only. More generally, though, the 'VIRT' (or VmSize) and 'RES' (or RSS) numbers are the most important to you. In many cases, some of this memory can be swapped out to disk so your program can actually be "bigger" than the physical memory of the machine/node and still not have any problems. But when a program uses a lot more memory than the physical memory of the node, it often causes file system performance issues.

Determine Memory of the Node that Your Job is Running On

A few different generations of AMD Milan, Intel Xeon, and Skylake processors are currently available on the Discover system. They have varying memory resources available.

The /proc/meminfo file provides information about memory resources on the node. This file is read-only and can be viewed with any text editor. The important fields to look for in the meminfo file include MemTotal, MemFree, and SwapTotal.

$ grep -i -e memfree -e memtotal -e swaptotal /proc/meminfo
MemTotal: 32858024 kB
MemFree: 30179228 kB
SwapTotal: 8193140 kB

To determine the amount of memory per CPUs, you'll also need to obtain the number of CPUs on the node by finding the number of "processor" entries in /proc/cpuinfo:

$ grep -c processor /proc/cpuinfo
16

Monitor How Your Program Life Affects Memory Usage Over Time

Note that the first method using the top command is a snapshot of memory usage at a time. If your program does a lot of allocating and freeing of memory (often done in object-oriented programs), then you may not see a full picture of what is going on in your program.

Policeme provides utility to monitor/record memory used by processes running on each compute node of a slurm job. The program generates information in XML format files which can then be post-processed, using supplied python script, to generate PNG format files.

Policeme Documentation