// Using Prism

Prism is equipped with several powerful servers built specifically for accelerating AI/ML/DL workloads with GPUs. This cutting edge platform is easy to access and the preinstalled software/libraries provide foundational tools that enable scientists to maximize their workflows.

Environment

System	Sockets	CPU Cores per socket	Total CPU Cores	CPU Memory	NVME Storage (TB)	GPUs and GPU Memory	Slurm Partition
gpu[001-022]	2	20	40	768	3.8	4x NVIDIA V100 32GB	compute (default)
gh[001-062]	1	72	72	480	1	1x NVIDIA H100 96GB, 480GB with coherent memory	grace
gg[001-002]	2	72	144	480	1	N/A	grace-computeonly
gpu100	2	64	128	1024	14	8x NVIDIA A100 40GB	dgx partition

Operating System and Software

The systems come with the following software and libraries pre-installed. However, users will have to use Conda environments to access python machine-learning packages.

Operating System: RHEL
CUDA 12
Conda/Miniforge 24.9

Python
Tensorflow
Pytorch
Facebook AI Similarity Search (FAISS)
XGBoost
RapidsAI

cuDF
cuML

dask-cudf, dask-xgboost, dask-cuda
EarthML/PyViz
Scikit-Learn

Access

Access to the Prism GPU Cluster is provided with your NCCS user account. Prism Jupyter Hub is available via NAMS request. You may connect by logging in to adapt.nccs.nasa.gov or adaptlogin.nccs.nasa.gov, then ssh to gpulogin1. Once you are connected to the login node, you will need to use SLURM in order to access the Prism GPU resources.

For more information on SLURM, see the SLURM section below.

For more information on access and login, see NCCS Account Setup.

Tools

Ganglia
To see gpu utilization on every node, go to: Prism Ganglia

All nodes have four GPUs, you need to select “gpuX_util” for GPUs 0, 1, 2 and 3. The DGX provides access to GPUs 4-7, as well.

JupyterHub
See here for information on accessing and using the current JupyterHubs on ADAPT.

Conda
Conda environments have been used to install the Python machine learning frameworks. These environments can be accessed by loading the Conda module with:

Prism GPU Cluster: ‘module load miniforge’

To activate an environment, run ‘conda activate <ENV>’ on the environment of your choice once the module is loaded.

Users can inspect the complete list of packages and versions installed within an environment by running: $ conda list

Users can also inspect other available environments by running: $ conda env list

Users may also create Conda environments in their home directory. This will allow users to maintain the environment on their own.

For more information regarding Conda usage in ADAPT, see our Instructional Video and Tech Talk slides.

If you are experiencing issues with Conda, and/or if additional package installation support is needed, please contact NCCS support .

Modules
Users are recommended to load both the Conda and NVIDIA modules in Prism using 'module load nvidia' and 'module load Conda'; use 'module spider' for more options. For more information on how to load modules, see the Tips & Info for New NCCS ADAPT Users Tech Talk slides under "Modules".

SLURM

SLURM allows for more efficient resource allocation, fairer sharing, and easier management of the resources on the GPU nodes. There are three main ways to interface with SLURM on the GPU nodes:

‘sbatch’: Submit a batch script to SLURM. Create a job script that can be submitted to the queue and call multiple tasks from within.

‘srun’: Specify resources for running a single command or execute a job step.

‘salloc’: Run interactively on allocated resources, or run a step by step job.

All three of these mechanisms share most, but not all, of the standard SLURM configuration flags. Some of the most useful are as follows:

Flag	Description
-G<NUM>	Specifies the total number of GPUs to be allocated to the job.
-t<TIME>	Allows you to set a time limit for your jobs and allocations. Acceptable formats include ‘-t <MINUTES>’, ‘-t HH:MM:SS’,’ -t D-HH:MM:SS’
--nodelist=<NODES>	Allows you to specify the nodes that you would like your jobs to run on. By default the pool includes all available nodes, but you could specify one (or more nodes separated by comma) to restrict the systems on which your work will run. This is not recommended though as you may end up waiting in a queue for a certain system when other resources may already be available.
-n <NUM_TASKS>	Specifies the number of tasks to run. In sbatch this is the maximum number of tasks to be run at any given time. This allows adequate resources to be allocated upon job submission.
-c <CPUS>	Specifies the number of processors to be allocated to each task.
-N <NUM_NODES>	Specifies the number of nodes to run on.
-J <JOB_NAME>	Allows you to name your job.
--mem	Specifies the minimum required amount of memory allocated per node.
-p	Selections the SLURM partition to use. Note the partitions in the environment table above. If not specified the default partition will be used.

Examples & Links to Documentation:
Here are some examples to get you started. If you desire more advanced configuration to optimize your jobs, reference the documentation (linked below) for each command to see the complete list of available flags and usage options.

SBATCH: ‘sbatch job.sh’
#!/bin/bash #SBATCH -G5 -t 60 -n5 -N1 -J myBatchScriptSLURMJob --export=ALL module load Conda conda activate <ENV> #Run the same task #Run tasks in parallel with ‘&’ and 'wait' srun -G3 -n1 python my_program.py1 & srun -G2 -n1 python my_program2.py & wait #Run tasks sequentially without ‘&’ srun -G5 -n1 python my_program3.py srun -G5 -n1 python my_program4.py

SBATCH: ‘sbatch job with partition.sh’
#!/bin/bash #SBATCH -G5 -t 60 -n5 -N1 -J myBatchScriptSLURMJob --export=ALL -p grace module load Conda conda activate <ENV> #Run the same task #Run tasks in parallel with ‘&’ and 'wait' srun -G3 -n1 python my_program.py1 & srun -G2 -n1 python my_program2.py & wait #Run tasks sequentially without ‘&’ srun -G5 -n1 python my_program3.py srun -G5 -n1 python my_program4.py

SRUN:
srun -G2 -t 60 -n1 --mem-per-cpu=100 -J myOneLineSLURMJob python myScript.py

SALLOC:
salloc -G1 -t 60 -n1 -c6 –mem-per-cpu=1028 --nodelist=gpu001 -J myInteractiveSLURMJob

SALLOC for the DGX (NVIDIA A100s):
salloc -G2 -p dgx

SALLOC for the Grace Hopper (Nvidia H100s):
salloc -G1 -p grace

SALLOC for the Grace Grace (CPU only):
salloc -p grace-cpuonly

Running salloc will give you an interactive shell with access to specified resources. This is similar to sshing into one of the nodes, however, the resources that you can use will be limited to those that are requested through your allocation.

Other Useful SLURM Commands for Managing Your Jobs:

squeue: Shows the list of jobs in the current queue.
scancel <JOB_ID>: Cancel one of your active jobs.
sinfo: Lists available resources in the SLURM cluster.

Learn more about SLURM on ADAPT

High Speed Data Storage

Should your workload be experiencing limitations as a result of file I/O, each system has several TB of high speed NVME based storage available for use. This storage is located at “/lscratch”. Run the command "mkdir /lscratch/$(whoami)" to create a directory owned by you. This storage is local to the node, and is not shared with other nodes in the cluster. Note that this space is temporary, and data stored there is subject to deletion at the completion of your job.

NASA Center for Climate Simulation

High Performance Computing for Science

// Using Prism

Environment

Access

Tools

SLURM

High Speed Data Storage

NCCS CONTACTS

ABOUT US

CONTACT US