// Using Prism
Prism is equipped with several powerful servers built specifically for accelerating AI/ML/DL workloads with GPUs. This cutting edge platform is easy to access and the preinstalled software/libraries provide foundational tools that enable scientists to maximize their workflows.
Environment
System | Sockets | CPU Cores per socket | Total CPU Cores | CPU Memory | NVME Storage (TB) | GPUs and GPU Memory | Slurm Partition |
---|---|---|---|---|---|---|---|
gpu[001-022] | 2 | 20 | 40 | 768 | 3.8 | 4x NVIDIA V100 32GB | compute (default) |
gh[001-062] | 1 | 72 | 72 | 480 | 1 | 1x NVIDIA H100 96GB, 480GB with coherent memory | grace |
gg[001-002] | 2 | 72 | 144 | 480 | 1 | N/A | grace-computeonly |
gpu100 | 2 | 64 | 128 | 1024 | 14 | 8x NVIDIA A100 40GB |
dgx partition |
The systems come with the following software and libraries pre-installed. However, users will have to use Conda environments to access python machine-learning packages.
- Operating System: RHEL
- CUDA 12
- Conda/Miniforge 24.9
- Python
- Tensorflow
- Pytorch
- Facebook AI Similarity Search (FAISS)
- XGBoost
- RapidsAI
- dask-cudf, dask-xgboost, dask-cuda
- EarthML/PyViz
- Scikit-Learn
Access
Access to the Prism GPU Cluster is provided with your NCCS user account. Prism Jupyter Hub is available via NAMS request. You may connect by logging in to adapt.nccs.nasa.gov or adaptlogin.nccs.nasa.gov, then ssh to gpulogin1. Once you are connected to the login node, you will need to use SLURM in order to access the Prism GPU resources.
For more information on SLURM, see the SLURM section below.
For more information on access and login, see NCCS Account Setup.
Tools
Ganglia
To see gpu utilization on every node, go to: Prism Ganglia
All nodes have four GPUs, you need to select “gpuX_util” for GPUs 0, 1, 2 and 3. The DGX provides access to GPUs 4-7, as well.
JupyterHub
See here for information on accessing and using the current JupyterHubs on ADAPT.
Conda
Conda environments have been used to install the Python machine learning frameworks. These environments can be accessed by loading the Conda module with:
- Prism GPU Cluster: ‘module load miniforge’
To activate an environment, run ‘conda activate <ENV>’ on the environment of your choice once the module is loaded.
Users can inspect the complete list of packages and versions installed within an environment by running:
$ conda list
Users can also inspect other available environments by running:
$ conda env list
Users may also create Conda environments in their home directory. This will allow users to maintain the environment on their own.
For more information regarding Conda usage in ADAPT, see our Instructional Video and Tech Talk slides.
If you are experiencing issues with Conda, and/or if additional package installation support is needed, please contact NCCS support.
Modules
Users are recommended to load both the Conda and NVIDIA modules in Prism using 'module load nvidia' and 'module load Conda'; use 'module spider' for more options. For more information on how to load modules, see the Tips & Info for New NCCS ADAPT Users Tech Talk slides under "Modules".
SLURM
SLURM allows for more efficient resource allocation, fairer sharing, and easier management of the resources on the GPU nodes. There are three main ways to interface with SLURM on the GPU nodes:
- ‘sbatch’: Submit a batch script to SLURM. Create a job script that can be submitted to the queue and call multiple tasks from within.
- ‘srun’: Specify resources for running a single command or execute a job step.
- ‘salloc’: Run interactively on allocated resources, or run a step by step job.
All three of these mechanisms share most, but not all, of the standard SLURM configuration flags. Some of the most useful are as follows:
Flag | Description |
---|---|
-G<NUM> | Specifies the total number of GPUs to be allocated to the job. |
-t<TIME> | Allows you to set a time limit for your jobs and allocations. Acceptable formats include ‘-t <MINUTES>’, ‘-t HH:MM:SS’,’ -t D-HH:MM:SS’ |
--nodelist=<NODES> | Allows you to specify the nodes that you would like your jobs to run on. By default the pool includes all available nodes, but you could specify one (or more nodes separated by comma) to restrict the systems on which your work will run. This is not recommended though as you may end up waiting in a queue for a certain system when other resources may already be available. |
-n <NUM_TASKS> | Specifies the number of tasks to run. In sbatch this is the maximum number of tasks to be run at any given time. This allows adequate resources to be allocated upon job submission. |
-c <CPUS> | Specifies the number of processors to be allocated to each task. |
-N <NUM_NODES> | Specifies the number of nodes to run on. |
-J <JOB_NAME> | Allows you to name your job. |
--mem |
Specifies the minimum required amount of memory allocated per node. |
-p | Selections the SLURM partition to use. Note the partitions in the environment table above. If not specified the default partition will be used. |