// Discover GPU Partition

GPU AVAILABILITY WITHIN THE DISCOVER CLUSTER

Scalable Unit 16 (SCU16) makes GPU resources available within the NCCS Discover cluster’s gpu_a100 partition, which comprises 10 AMD nodes that each include:

  • 48 CPU-cores of "Rome" processor architecture, along with
  • 4 NVIDIA A100 GPUs (each possessing 6,912 CUDA cores).

Note: These nodes will be fully shared, with individual nodes running jobs belonging to multiple users. Each user is limited to a maximum of one node or 4 GPUs within this partition. If you have a use case that requires more than these limits, please submit a request describing your use case to support@nccs.nasa.gov.

Slurm Options

To access these GPU nodes, use the following sbatch inline directives (or their equivalent on the salloc command line): #SBATCH --partition=gpu_a100
#SBATCH --constraint=rome
and specify the number of GPUs and CPUs your job requires; these directives specify one node with 2 GPUs and 10 CPUs: #SBATCH --ntasks=10
#SBATCH --gres=gpu:2

If you don't specify a number of CPUs with the --ntasks (or -n) option, your job will be allocated 12 CPUs per requested GPU. In this partition, memory is scheduled pro-rated by the number of GPUs your job requests. By default, your job will be allocated 122 gigabytes of memory per GPU. You may specify, for example, an alternative value of 100 gigabytes per GPU with: #SBATCH --mem-per-gpu=100G

Finally, you may also access these nodes in an interactive fashion using salloc; for example: salloc --partition=gpu_a100 --constraint=rome --mem-per-gpu=100G --ntasks=10 --gres=gpu:2

NVIDIA CUDA Environment Variables

Within your job runtime environment, you will find the following CUDA-related variables are set:

  • CUDA_VISIBLE_DEVICES
  • CUDA_DEVICE_ORDER
  • CUDA_DEVICE_ORDER (May be modified for your application's needs)

Please see additional details regarding these values, and examples of using only a subset of your job’s allocated resources (with the srun command), in SchedMD's Slurm GPU usage documentation:

Slurm GPU usage documentation

and in the NVIDIA CUDA documentation:

NVIDIA CUDA documentation