nasalogo
wide_banner_withNCCSlogo

NCCS Primer: A User's Guide

The NASA Center for Climate Simulation provides compute nodes for batch and interactive analysis. Thousands of nodes are available to manage serial and parallel processing tasks. The facilities are made up of several groups of computers, each of which is tasked with a particular aspect of data-intensive high performace computing.

In April 2012, some of us and some of our users met for an informal brown-bag talk that addresses topics of concern or interest. We hope to make it into a series of many. The first one was about the tape archive system, ramifications of different patterns of use and suggestions to improve everyone's overall experience. The slides from this session are available in PDF in Your data on tape. We plan to add a new topic every other week, so come back often.

This Primer and Guide is divided into sections that focus on specific details of using the resources.

  • DISCOVER is the main compute cluster for processing batch jobs requiring significant compute resources. It is made up of several so called scalable units that offer a variety of processor types. It consists of a mix of nodes dedicated to computing, interactive data analysis, and managing the global parallel file system (GPFS).

  • DALI nodes are special login nodes with very large physical memory, and are normally used interactively for large-scale data analysis.

  • DIRAC is the system that manages the archive system. You will log in on a Dirac node to manage your file migration in the archives

These systems are connected by shared file systems and by network communications. It is vital to become familiar with some of the details, as the overall performance and your experience depend on it.

Software is arbitrarily divided into the following categories, Visualization and Analtics, Compilers, and Libraries Support for various flavors and versions of MPI, Fortan, C and C++ from different vendors is provided. The Linux modules package is used to manage users' environment variables to ensure smooth operation.

The next section contains important announcements about issues affecting all Discover users, and is likely to change often. It is a good idea to revisit this page regularly.

Quick Start

You can login to a Discover or a Dali node using procedures described in the System Login page. Now you can also request to be logged on a Dali node with a GPU.

$ ssh dali $ ssh dali-gpu

The first command will get you a Dali node with or without a GPU. The second comand will only connect you to a Dali node with GPU.

Caveats

warning icon

Archive issues

Occasionally when there are system issues either on the archive cluster machines or on the discover or dali nodes, file writes to the archive will fail, but ls(1) output on the archive filename will show the full file size even though there is no data in the file. The archive administrators usually detect this. NCCS Support emails the file owner explaining that despite the correct-looking file size of a particular archive file, the file was found to actually contain no data blocks and is therefore corrupt, and the owner is requested to replace the file if it is needed, or delete it if it is not.

Killing the right PBS job

We recently hit a million in the PBS job numbering scheme. As a result, the job id has 7 digit numbers, for example: 1000023.borgpbs1. Some of the PBS utilities use a fixed format for reporting jobs, so the last character is obscured. This makes it look like your job-ID is 1000023.borgpbs (last '1' missing). When you try to kill this jobs with qdel, it will tell you that the job-ID is bad. Try instead

You can also just use the number without the '.borgpbs1' part. The following all work:

$ qdel 1000023.borgpbs1 $ qdel 1000023

To avoid the problem of truncating the fully qualified jobid, please use one of the following commands:

$ qstat -1anw # that is a 1 (one) not lowercase 'L' $ qstat -aw

PBS jobs waiting indefinitely

After recent system upgrades, some PBS resource names that appear inside legacy scripts are no longer valid. Occasionally, some jobs are submitted with a reference to Woodcrest, Dempsey, or Harptertown processors. These jobs will be waiting in queue indefinitely. Please do not use any of the following specific processors references in your PBS scripts:

#PBS -l proc=harp #PBS -l proc=demp #PBS -l proc=wood

In fact, we do NOT recommend using "proc=west(or neha)" for requesting a node type. You should use "ncpus" in your PBS scripts. "ncpus=8" would request either Nehalem or Westmere nodes, whichever are available. "ncpus=12" would request Westmere nodes only.

Future Plans

This Primer is meant to be a quick introduction to the facilities for new users. It is, like all web pages, a living document that will grow as systems, configurations, technology change, and as we learn better ways of doing things. Very soon we plan to include sections on lessons learnt, best use practices based on user experiences, and recommendations from veteran users and systems staff. An FAQ section will be accompanied by another section.

Suggestions are always welcome. Please send email to NCCS Support: support@nccs.nasa.gov

Valid XHTML 1.0 Strict Valid CSS!

Suggestions are always welcome. Please send mail to NCCS Support: support at nccs.nasa.gov
usagovlogo
 
nasalogo

shim