An Introduction to SCU10 (and looking forward to SCU11)
-------------------------------------------------------

$Id: scu10.txt,v 1.8 2015/03/09 16:22:07 ewinter Exp $


Introduction
------------

The addition of SCU10 (and soon, SCU11) represents a significant
increase in the total computing capacity available on
discover. However, this increased capacity is in a more "concentrated"
form - the nodes on SCU10 have 28 cores each, as compared to the 12
cores on the old Westmere nodes, and 16 cores on the existing Sandy
Bridge nodes. Therefore, the total number of nodes will be decreasing
at the same time the total compute capacity is increasing. This new
set of resources will require some changes on the part of the users
and developers to ensure efficient use. This document will provide a
quick introduction to what SCU10 is, and what you may need to do to
get your code to run efficiently on it.


What is SCU10?
--------------

SCU10 is the latest addition to the discover cluster. It replaced the
older SCU7, which contained 1200 12-core Westmere nodes. The SCU10
hardware has the following characteristics:

* 1080 Intel Haswell nodes
* 2 sockets/node, 14 cores/socket = 28 cores/node
* 128 GB memory, ~120 GB available to user
* SLES 11 SP3
* Local scratch space
* NO SWAP SPACE!
* FDR (Fourteen Data Rate) Infiniband


How can SCU10 be used?
----------------------

Currently, the nodes in SCU10 are divided into two sets:

720 nodes for general compute use
360 nodes available for a dedicated project

The nodes in SCU10 are currently available as part of the 'compute'
partition. To request that your job run on SCU10 (and thus on a
Haswell node), use the --constraint option to the SLURM sbatch
command. Using --constraint, you specify the special features that you
require in your allocated nodes. For SCU10, this can be one of two
values:

1) Request a Haswell node:
#SBATCH --constraint=hasw

2) Request a node running the SLES11 SP3 operating system:
#SBATCH --constraint=sp3

Note that the Haswell nodes in the old 'sp3' partition have been
merged into the general-use 'compute' partition. References to
'--partition=sp3' will now fail with an error indicating an invalid
partition. If you encounter this problem, simply change:

#SBATCH --partition=sp3

to

#SBATCH --constraint=hasw
or
#SBATCH --constraint=sp3

and that should fix the problem.

In the future, most of the old Sandy Bridge nodes will be migrated to
the SP3 version of SLES11. When that happens, the 'compute' partition
will contain a mix of Sandy Bridge and Haswell nodes, and a mix of SP1
and SP3 versions of SLES 11. At that point, if you need to specify a
particular processor type for your job (possibly for compatibility
reasons), then you _must_ specify the processor type with the
--constraint option to sbatch:

...
# This command will allocate Haswell (SCU10) nodes.
sbatch --constraint=hasw myjob.sh
...

or:

...
# This command will allocate Sandy Bridge (SCU8,9) nodes.
sbatch --constraint=sand myjob.sh
...

Since these commands do not specify a partition, they will be sent to
the default 'compute' partition, and the --constraint option will
ensure that nodes with the proper processor type are allocated to the
job. Once Sandy Bridge nodes are migrated to SP3, then
--constraint=sp3 will allocate either Sandy Bridge or Haswell nodes,
so avoid using that option unless your software must run on SP1.

Note that the goal is to move nearly all Sandy Bridge nodes to SP3 by
late summer/early fall 2015. However, some SP1-based nodes may remain
for several months after that point.

A new set of login nodes has been set up to provide easy access to the
computing environment that is available on the SCU10 nodes. You can
log in to 'discover-sp3' instead of 'discover'. These login nodes are
configured identically to the SCU10 compute nodes (28 Haswell cores,
128 GB memory), and so code that you compile on these login nodes
should run without problems on the SCU10 Haswell nodes.


The fine print: possible user changes for SCU10
-----------------------------------------------

There is a (small but non-zero) chance that your code will run on
SCU10 without recompiling, and your job scripts may run without any
changes. If that is the case, you're lucky - just use the same code
you have always used, but now run it on SCU10.

The more likely case is that changes of varying degrees of complexity
to your job scripts and/or program code will be required.

First, consider your job script. Some of the script changes that may
be needed to ensure SCU10 use were described in the previous
section. However, the new hardware arrangement (many more cores and
_much_ more memory per node, and NO SWAP SPACE) will likely lead to
changes to your job submission scripts to take best advantage of the
new capabilites. Since the memory/core is higher, and the cores/node
is higher, you should examine the distribution of tasks in your job
script for possible improvements.

For example, if your (old) job script requests 480 Sandy Bridge cores,
you job submission script might include the lines:

...
#SBATCH --constraint=sand
#SBATCH --ntasks=480
...

or the lines:

...
#SBATCH --constraint=sand
#SBATCH --nodes=30
...

Note that we did not specify the memory per task, since the default
task distribution is 1 task per core, and the Sandy Bridge nodes have
2 GB memory per core.

To run this job in the 'compute' partition, on the SCU10 Haswell
nodes, you should use:

...
#SBATCH --constraint=hasw
#SBATCH --ntasks=480
...

This set of options will get you CEIL(480/28) = 18 Haswell nodes, or a
total of 18*28 = 504 cores. In the default case of 1 task per core,
this will lead to 17 SCU10 nodes with 28 tasks each, and 1 SCU10 node
with 4 tasks, leaving 24 cores unused on the final node.

The point we want to emphasize is that you should try to avoid
specifying a node count in your jobs - specify numbers for cores,
tasks, and memory, and let SLURM do the allcoations. Ultimately, job
scripts that avoid specification of cluster topology will be the most
portable, and durable. The more you know about the resource
requirements of your job, the better SLURM can ensure you get what you
need, when you need it.

Next, consider your actual C/FORTRAN/<insert your favorite language
here>. The SCU10 nodes differ in both hardware and software from the
Sandy Bridge nodes on discover. The SLES 11 SP3 operating system on
SCU10 is 2 releases beyond the SLES 11 SP1 version on older nodes. As
is often the case, operating system upgrades like this can require
recompiling your code just to get it to run efficiently (or at
all...). In such cases, your existing build procedures should work
with minimal changes.

However, the new Haswell processors in the SCU10 nodes provide many
new features, such as the 512-bit-wide AVX2 vector registers. When
properly used, these features can provide significant performance
improvements. Therefore, once you have your code operating correctly
on SCU10, it will probably be worthwhile to examine your compiler
options and other program settings to see if you can take advantage of
new features. Proceed slowly and cautiously as you change these
settings, and be sure to compare results from new builds with known
good results from previous builds to ensure you do not introduce a new
problem. We will provide guidance and assistance on optimizing code to
use these features as experience with the Haswell nodes is developed.

It bears repeating: SCU10 HAS NO SWAP SPACE. The lack of swap space
will be largely ameliorated by the much larger installed
memory. However, if your code does use all of the node memory, your
program will _fail immediately_, rather than trigger a page fault
(swap). Therefore, you should already have a good idea of the memory
requirements of your code during nominal operation. To help monitor
the memory usage of your code, we recommend the use of the "policeme"
utility, already in use on discover. Please contact NCCS User Services
for instructions on how to use policeme to monitor your memory
usage. See the description of the --mem-per-cpu option later in this
document for a discussion of how to ensure you have sufficient memory
available.

If you need to recompile on SCU10, most of the existing compiler and
MPI modules should work fine. We have tested all available
combinations of Intel compiler and MPI modules, and we are continuing
tests with non-Intel compilers and MPI modules. In general, avoid
using the three oldest Intel MPI modules: impi-3.2.2.006, 4.0.3.008,
4.1.0.024. These modules are known to cause problems on SCU10 with
even a simple MPI program, especially when more than one node is
used. Additionally, to use the advanced features of the Haswell cores,
you will need to use the most recent compiler modules. Finally, you
may need to alter settings in some MPI environment variables for
optimum efficiency.

In addition to the existing Intel MPI modules, we are building
SP3-specific versions of many other MPI modules, suach as MVAPICH and
OpenMPI. These will be made available in module form as they are built
and tested. The latest set of available MPI modules can always be
found with:

...
module avail mpi other/mpi
...

The SCU10 nodes (only) can also make use of the vendor-supplied (SGI)
MPI module, which has proven to be relatively robust:

...
module load mpi/sgi-mpt-2.1.1
...

On SCU10, you may encounter run-time problems when large numbers of
nodes are used (>300 or so, but YMMV). One potential problem is that
teardown of the MPI connections when your program completes can take a
very long time (30 minutes or more) when a large number of nodes are
used. The severity of these issues varies with the MPI module in use.


A note about cron jobs
----------------------

The discover-cron nodes are currently SP1 nodes. Cron jobs which run
code recompiled for SCU10 may require the job to ssh to a SCU10 node
to work.


A note about ensuring sequential job execution
----------------------------------------------

Due to scheduling algorithms (small jobs preferred during the day,
large jobs at night), jobs may not execute in the order in which they
are submitted. This can cause strange failures in job chains.

If a specific job sequence is required, you can use the
--dependency=jobid:... option to sbatch. For example, if you have a
post-processing job to run after 3 independent processing jobs have
completed, you can do something like this in your job script:

...
# String to hold the job IDs.
job_ids=''

# Submit the first parallel processing job, save the job ID.
job_id=`sbatch job1.sh | cut -d ' ' -f 4`
job_ids="$job_ids:$job_id"

# Submit the second parallel processing job, save the job ID.
job_id=`sbatch job2.sh | cut -d ' ' -f 4`
job_ids="$job_ids:$job_id"

# Submit the third parallel processing job, save the job ID.
job_id=`sbatch job3.sh | cut -d ' ' -f 4`
job_ids="$job_ids:$job_id"

# Wait for the processing jobs to finish successfully, then run the
# post-processing job.
sbatch --dependency=afterok$job_ids postjob.sh
...

There are many ways to tailor the dependencies, such as varying exit
conditions. See the sbatch man page for details, and ask the NCCS for
help if you're having trouble with this feature (we use it ourselves,
and it works great).

As a temporary alternative to the --dependency option to sbatch, you
can use the special 'serial' quality-of-service (--qos=serial). The
QoS rules ensure that the jobs are executed one at a time.


Coming attraction: shared nodes
-------------------------------

Nodes on SCU10 will initially be exclusively allocated. This means
that if your job is assigned to one or more SCU10 nodes, your job will
be the _only_ job running on those nodes. This is how allocations on
discover are currently managed - on older nodes, allocation of a node
is exclusive to a single job.

However, exclusive allocation can lead to significant inefficiencies
in resource use. For example, if you only need 8 cores on a node, then
20 cores are going unused on a Haswell node, and 8 cores on a Sandy
Bridge node. Therefore, we plan to eventually transition SCU10 (and
possibly older nodes) to a shared operational mode. That is, if you
are only using 8 cores on the 28-core SCU10 node, the other 20 cores
may be made available to other jobs, even if those jobs are submitted
by other users.

Shared nodes will not be happening _immediately_, but it will be
happening _eventually_. Users should start examining their codes and
scripts to determine what the optimal distribution of resources needs
to be. This means you have to understand not only how many cores you
need, but how much memory each core will require. For example, if you
need 20 cores, and each core needs 4 GB of memory, that's a total of
80 GB of memory. That's more cores and memory than was available on
any older node on discover, but it still leaves 8 cores and 48 GB of
memory available on a Haswell node (40 GB allowing for the operating
system). Therefore, when you make your job request, you should include
this information as options for sbatch:

...
sbatch --ntasks=20 --mem=81920 myjob.sh
...

Note that the memory requirement must be specified in megabytes.

Alternatively, if your job uses one core per MPI task (the usual
case), you can specify the memory on a per-core basis:

...
sbatch --ntasks=20 --mem-per-cpu=4096 myjob.sh
...

When nodes are shared, either of these commands will allow the unused
on-node resources (cores or memory) to be allocated to other jobs.

You should also keep in mind that shared nodes may (will?) result in
new problems we have not seen before. Interactions between multiple
on-node jobs, especially I/O functions, may cause unexpected changes
in performance. When this happens, please contact us for assistance in
identifying and resolving the problem.

Keep the following points in mind:

* If you absolutely must have exclusive access to a node, use the
  --exclusive option to sbatch. We do _not_ want you to do this, as a
  general rule. It should only be used in an emergency situation, and
  it would be very helpful if you notified NCCS staff of your desire
  to use this option.

* If you know that (for some reason) you will need a minimum number of
  cores on each allocated node, use the --mincpus=N option to
  sbatch. Otherwise, just specify the number of tasks you need, and
  let SLURM do the core allocation.


Coming attractions: SCU11
-------------------------

* 600 nodes, same node specs as SCU10.

* Possibly available in March 2015.

* Initial availability may be via a SLURM "reservation" for testing.

* All nodes should be general-use.