// Using Prism
Prism is equipped with several powerful servers built specifically for accelerating AI/ML/DL workloads with GPUs. This cutting edge platform is easy to access and the preinstalled software/libraries provide foundational tools that enable scientists to maximize their workflows.
- Environment
- Access
- Tools
- SLURM
- High Speed Data Storage
- Converting workflows to ARM architecture
- Using containers on Prism
Environment
| System | Sockets | CPU Cores per socket | Total CPU Cores | CPU Memory | NVME Storage (TB) | GPUs and GPU Memory | Slurm Partition |
|---|---|---|---|---|---|---|---|
| gpu[001-022] | 2 | 20 | 40 | 768 | 3.8 | 4x NVIDIA V100 32GB | compute (default) |
| gh[001-062] | 1 | 72 | 72 | 480 | 1 | 1x NVIDIA H100 96GB, 480GB with coherent memory | grace |
| gg[001-002] | 2 | 72 | 144 | 480 | 1 | N/A | grace-cpuonly |
| gpu100 | 2 | 64 | 128 | 1024 | 14 | 8x NVIDIA A100 40GB |
dgx partition |
The systems come with the following software and libraries pre-installed. However, users will have to use Conda environments to access python machine-learning packages.
- Operating System: RHEL
- CUDA 12
- Conda/Miniforge 24.9
- Python
- Tensorflow
- Pytorch
- Facebook AI Similarity Search (FAISS)
- XGBoost
- RapidsAI
- dask-cudf, dask-xgboost, dask-cuda
- EarthML/PyViz
- Scikit-Learn
Access
Access to the Prism GPU Cluster is provided with your NCCS user account. Prism Jupyter Hub is available via NAMS request. You may connect by logging in to adapt.nccs.nasa.gov or adaptlogin.nccs.nasa.gov, then ssh to gpulogin1. Once you are connected to the login node, you will need to use SLURM in order to access the Prism GPU resources.
For more information on SLURM, see the SLURM section below.
For more information on access and login, see NCCS Account Setup.
Tools
Ganglia
To see gpu utilization on every node, go to: Prism Ganglia
All nodes have four GPUs, you need to select "gpuX_util" for GPUs 0, 1, 2 and 3. The DGX provides access to GPUs 4-7, as well.
JupyterHub
See here for information on accessing and using the current JupyterHubs on ADAPT.
Conda
Click here to read our nugget about configuring Conda.
Conda environments have been used to install the Python machine learning frameworks. These environments can be accessed by loading the Conda module with:
- Prism GPU Cluster: 'module load miniforge'
To activate an environment, run 'conda activate <ENV>' on the environment of your choice once the module is loaded.
Users can inspect the complete list of packages and versions installed within an environment by running:
$ conda list
Users can also inspect other available environments by running:
$ conda env list
Users may also create Conda environments in their home directory. This will allow users to maintain the environment on their own.
For more information regarding Conda usage in ADAPT, see our Instructional Video and Tech Talk slides.
If you are experiencing issues with Conda, and/or if additional package installation support is needed, please contact NCCS support.
Modules
Users are recommended to load both the Conda and NVIDIA modules in Prism using 'module load nvidia' and 'module load Conda'; use 'module spider' for more options. For more information on how to load modules, see the Tips & Info for New NCCS ADAPT Users Tech Talk slides under "Modules".
SLURM
SLURM allows for more efficient resource allocation, fairer sharing, and easier management of the resources on the GPU nodes. There are three main ways to interface with SLURM on the GPU nodes:
- 'sbatch': Submit a batch script to SLURM. Create a job script that can be submitted to the queue and call multiple tasks from within.
- 'srun': Specify resources for running a single command or execute a job step.
- 'salloc': Run interactively on allocated resources, or run a step by step job.
All three of these mechanisms share most, but not all, of the standard SLURM configuration flags. Some of the most useful are as follows:
| Flag | Description |
|---|---|
| -G<NUM> | Specifies the total number of GPUs to be allocated to the job. |
| -t<TIME> | Allows you to set a time limit for your jobs and allocations. Acceptable formats include '-t <MINUTES>', '-t HH:MM:SS',' -t D-HH:MM:SS' |
| --nodelist=<NODES> | Allows you to specify the nodes that you would like your jobs to run on. By default the pool includes all available nodes, but you could specify one (or more nodes separated by comma) to restrict the systems on which your work will run. This is not recommended though as you may end up waiting in a queue for a certain system when other resources may already be available. |
| -n <NUM_TASKS> | Specifies the number of tasks to run. In sbatch this is the maximum number of tasks to be run at any given time. This allows adequate resources to be allocated upon job submission. |
| -c <CPUS> | Specifies the number of processors to be allocated to each task. |
| -N <NUM_NODES> | Specifies the number of nodes to run on. |
| -J <JOB_NAME> | Allows you to name your job. |
| --mem |
Specifies the minimum required amount of memory allocated per node. |
| -p | Selections the SLURM partition to use. Note the partitions in the environment table above. If not specified the default partition will be used. |
Examples & Links to Documentation:
Here are some examples to get you started. If you desire more advanced configuration to optimize your jobs, reference the documentation (linked below) for each command to see the complete list of available flags and usage options.
SBATCH: 'sbatch job.sh'
#!/bin/bash
#SBATCH -G5 -t 60 -n5 -N1 -J myBatchScriptSLURMJob --export=ALL
module load Conda
conda activate <ENV>
#Run the same task
#Run tasks in parallel with '&' and 'wait'
srun -G3 -n1 python my_program.py1 &
srun -G2 -n1 python my_program2.py & wait
#Run tasks sequentially without '&'
srun -G5 -n1 python my_program3.py
srun -G5 -n1 python my_program4.py
SBATCH: 'sbatch job with partition.sh'
#!/bin/bash
#SBATCH -G5 -t 60 -n5 -N1 -J myBatchScriptSLURMJob --export=ALL -p grace
module load Conda
conda activate <ENV>
#Run the same task
#Run tasks in parallel with '&' and 'wait'
srun -G3 -n1 python my_program.py1 &
srun -G2 -n1 python my_program2.py & wait
#Run tasks sequentially without '&'
srun -G5 -n1 python my_program3.py
srun -G5 -n1 python my_program4.py
SRUN:srun -G2 -t 60 -n1 --mem-per-cpu=100 -J myOneLineSLURMJob python myScript.py
SALLOC:salloc -G1 -t 60 -n1 -c6 –mem-per-cpu=1028 --nodelist=gpu001 -J myInteractiveSLURMJob
SALLOC for the DGX (NVIDIA A100s):salloc -G2 -p dgx
SALLOC for the Grace Hopper (Nvidia H100s):salloc -G1 -p grace
SALLOC for the Grace Grace (CPU only):salloc -p grace-cpuonly
Running salloc will give you an interactive shell with access to specified resources. This is similar to sshing into one of the nodes, however, the resources that you can use will be limited to those that are requested through your allocation.
Other Useful SLURM Commands for Managing Your Jobs:
- squeue: Shows the list of jobs in the current queue.
- scancel <JOB_ID>: Cancel one of your active jobs.
- sinfo: Lists available resources in the SLURM cluster.
Learn more about SLURM on ADAPT
High Speed Data Storage
Should your workload be experiencing limitations as a result of file I/O, each system has several TB of high speed NVME based storage available for use. This storage is located at "/lscratch". Run the command "mkdir /lscratch/$(whoami)" to create a directory owned by you. This storage is local to the node, and is not shared with other nodes in the cluster. Note that this space is temporary, and data stored there is subject to deletion at the completion of your job.
Converting workflows to ARM architecture
- Non AI/ML Python Environment
- Python environments compiled in x86_64 architecture will need to be recompiled for aarch64 (ARM) architecture.
- Load the NCCS minforge/mamba module
module load miniforge - First export the configuration of the original x86 conda environment.
conda env export -n (environment name) --no-builds > environment.yaml
Ex: conda env export --no-builds -n torch_roth > environment.yaml - Attempt to create a new environment on ARM based machine.
We are using mamba for this step as it has better version translation capabilities. Using the conda command will result in a failed conversion.
mamba env create -n (new environment name) -f (Dump file just created)
Ex: mamba env create -n arm_torch2 -f environment.yamlThis command may fail if packages installed on an x86_64 environment are not available for ARM (aarch64). In this case the package that could not be installed will need to be removed from the dump file created earlier.
If you have Pytorch installed and intend to use it, you will want to remove torch torchaudio and torchvision from the list of packages to install. This will allow a new install of Pytorch that will work properly with the Grace Hopper nodes.
- After the install successfully completes, you can load your environment.
- If you need Pytorch, please continue on to the next section for instructions on installing Pytorch.
- AI/ML Python environment.
- Pytorch
- A Pytorch package that is compatible with Grace Hopper can be installed in your conda environment from the pytorch site:
[bsroth@gh004 ~]$ conda create --name torch
[bsroth@gh004 ~]$ conda activate torch
(torch) [bsroth@gh004 ~]$ conda install pip
An example version:
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu129
You can alter the versions of torch,torchvision and torchaudio as require. In addition the 'cu129' directory refers to the cuda version 12.9. This can be altered as well to the appropriate cuda version. - Note if a list of available versions are needed, visit the url for the cuda version you are attempting to use. In the example above if you go to https://download.pytorch.org/whl/cu129 you can find directories for each application and find available versions listed within.
- A Pytorch package that is compatible with Grace Hopper can be installed in your conda environment from the pytorch site:
- Tensorflow
- At this point in time Tensorflow for Grace Hopper is only available via container from the Nvidia container repository Link Here. You can download this container form Nvidia and customize as needed. Different versions are available via the tags tab.
- Pytorch
Using containers on Prism
Podman
- Podman is available on all nodes
- Please note that Podman/Docker container images do NOT work properly if stored in your $HOME or $NOBACKUP directories.
- The user profile script will create a /lscratch/$UID directory and setup the Podman storage settings files to automatically store Podman containers on /lscratch/$UID. The storage setting file will only be created if it does not currently exist.
- This file can be found at $HOME/.config/containers/storage.conf
- To pass a GPU to the container use the following arguments:
podman run -it --rm -v $HOME:$HOME -v /app:/app -v $NOBACKUP:$NOBACKUP --device nvidia.com/gpu=all --security-opt=label=disable (container path ID or url to pull)
Ex:
podman run -it --rm -v $HOME:$HOME -v /app:/app -v $NOBACKUP:$NOBACKUP --device nvidia.com/gpu=all --security-opt=label=disable nvcr.io/nvidia/tensorflow:24.05-tf2-py3-igpu
Singularity/Apptainer
- Singularity is available via module system on all Prism nodes.
- Please note that if you are pulling a Docker/Podman container to convert to a .sif file, you will need to set the following environment variables. These environment variables can be set to any local disk, however we suggest /lscratch as the disk has a fairly large storage space.
- To start a container with the GPU available inside the container please run the following:
To explain, by default Singularity and Podman do not share the GPU into the container, you need extra arguments.
export SINGULARITY_TMPDIR=/lscratch/$USER/
export SINGULARITY_CACHEDIR=/lscratch/$USER/
export TMPDIR=/lscratch/$USER/
singularity exec --nv -B $HOME -B /app -B $NOBACKUP (singularity image)
Ex:
singularity exec --nv -B $HOME -B /app -B $NOBACKUP $HOME/pytorch-test.sif nvidia-smi
Troubleshooting
- Podman
- Container does not download properly, shows "diff: operation no supported" error at end of download.
- Our clustered storage system does not support the layered filesystem used by Podman/Docker. Typically the user profile script will fix this on login to any of the Prism GPU nodes.
- The error typically looks as below:
Error: copying system image from manifest list: writing blob: adding layer with blob "sha256:b91d8878f844c327b4ff924d4973661a399f10256ed50ac7c640b30c5894166b"/""/"sha256:238e596eb101589755d14f2ad4a1979baa274147028bba76728b1b0a069c00e0": creating read-only layer with ID "238e596eb101589755d14f2ad4a1979baa274147028bba76728b1b0a069c00e0": lsetxattr /home/bsroth/.local/share/containers/storage/overlay/238e596eb101589755d14f2ad4a1979baa274147028bba76728b1b0a069c00e0/diff: operation not supported - To solve this issue check the following:
- Make sure /lscratch/$USER exists
- Remove /home/$USER.config/containers directory if it exists
- Log out of the node and back into the node to run the profile script again.
- Ensure that /home/$USER/.config/containers/storage.conf exists with contents:
[root@gh004 containers]# cat storage.conf
[storage]
driver = "overlay"
runroot = "/run/user/578207330"
graphroot = "/lscratch/bsroth"
Note the UID in runroot will be different and the user id should match your user id. - Remove the directory in /home/$USER/.local/containers or rename the directory, then repeat the previous steps.
- Container does not download properly, shows "diff: operation no supported" error at end of download.
- Singularity/Podman
- Docker container conversion to .sif file freezes or has errors.
- Typically when errors appear similar to the following, the environment will need to be set to use a local file system as opposed to a shared file system (/home or /explore/nobackup)
2025/10/09 10:18:11 warn xattr{usr/share/doc/cuda-runtime-9-0/NVIDIA-CUDA.jpg} ignoring ENOTSUP on setxattr "user.rootlesscontainers"
2025/10/09 10:18:11 warn xattr{/explore/nobackup/people/bsroth/.nccstmp/build-temp-3844063746/rootfs/usr/share/doc/cuda-runtime-9-0/NVIDIA-CUDA.jpg} destination filesystem does not support xattrs, further warnings will be suppressed - To solve this the following environment variables will need to be set, making sure that the directory you are indicating exists:
export SINGULARITY_TMPDIR=/lscratch/$USER/
export SINGULARITY_CACHEDIR=/lscratch/$USER/
export TMPDIR=/lscratch/$USER/
- Typically when errors appear similar to the following, the environment will need to be set to use a local file system as opposed to a shared file system (/home or /explore/nobackup)
- Docker container conversion to .sif file freezes or has errors.
AI/ML workflows on Jupyter Hub for Grace Hopper
- NCCS provides customized containers for Tensorflow and Pytorch based on containers from Nvida NGC.
- These containers are available via the launcher.
- Any changes made inside the container will be temporary as the container is reloaded each time the kernel is reset.
Custom Kernels in Grace Hopper Jupyter Hub
If you desire additional python or other packages installed inside of the Tensorflow or Pytorch containers you can user the NCCS containers as a base to build a custom container to load into Jupyter Hub.
- Existing containers available to users can be found at /app/jupyter/kernels.
- Create a apptainer/singularity build file:
Nvidia containers are Debian based and use apt for package management.
Bootstrap: localimage
From: (path to base image)
# Optional for metadata and documentation
%labels
Image prism-nvidia-pytorch
Tag-name tag-value
%help
Help message
%post
(additional bash commands to install needed packages)
Example:
Bootstrap: localimage
From: /app/jupyter/kernels/pytorch/pytorch_aarch64.sif
# Optional for metadata and documentation
%labels
Image prism-nvidia-pytorch
Tag-name tag-value
%help
Help message
%post
pip install mypackage - Save this file.
- Build the new container image:
singularity build (container file to create).sif (path to build file)
example:
singularity build mycontainer.sif mybuildfile.txt - Once the container is built create a directory for the custom kernel configuration file in ~/.local/share/jupyter/kernels
mkdir ~/.local/share/jupyter/kernels/<kernelname>
Example:
mkdir ~/.local/share/jupyter/kernels/mykernel - Create the custom kernel "kernel.json" file in the path created in previous step.
- Start a Jupyter Hub session and verify the custom kernel appears in the kernel pull down menu and works as expected.
Template:
{
"argv": [
"apptainer",
"exec",
"--nv",
"-B",
"/app,/home,/explore/nobackup/people,/explore/nobackup/projects,{connection_file}",
"(path to custom container)",
"python",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"display_name": "<kernel display name>",
"language": "python",
"metadata": {
"debugger": true
}
}
Example:
{
"argv": [
"apptainer",
"exec",
"--nv",
"-B",
"/app,/home,/explore/nobackup/people,/explore/nobackup/projects,{connection_file}",
"/home/bsroth/test.sif",
"python",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"display_name": "bsroth pytorch",
"language": "python",
"metadata": {
"debugger": true
}
}


