An Introduction to SCU10 (and looking forward to SCU11) ------------------------------------------------------- $Id: scu10.txt,v 1.8 2015/03/09 16:22:07 ewinter Exp $ Introduction ------------ The addition of SCU10 (and soon, SCU11) represents a significant increase in the total computing capacity available on discover. However, this increased capacity is in a more "concentrated" form - the nodes on SCU10 have 28 cores each, as compared to the 12 cores on the old Westmere nodes, and 16 cores on the existing Sandy Bridge nodes. Therefore, the total number of nodes will be decreasing at the same time the total compute capacity is increasing. This new set of resources will require some changes on the part of the users and developers to ensure efficient use. This document will provide a quick introduction to what SCU10 is, and what you may need to do to get your code to run efficiently on it. What is SCU10? -------------- SCU10 is the latest addition to the discover cluster. It replaced the older SCU7, which contained 1200 12-core Westmere nodes. The SCU10 hardware has the following characteristics: * 1080 Intel Haswell nodes * 2 sockets/node, 14 cores/socket = 28 cores/node * 128 GB memory, ~120 GB available to user * SLES 11 SP3 * Local scratch space * NO SWAP SPACE! * FDR (Fourteen Data Rate) Infiniband How can SCU10 be used? ---------------------- Currently, the nodes in SCU10 are divided into two sets: 720 nodes for general compute use 360 nodes available for a dedicated project The nodes in SCU10 are currently available as part of the 'compute' partition. To request that your job run on SCU10 (and thus on a Haswell node), use the --constraint option to the SLURM sbatch command. Using --constraint, you specify the special features that you require in your allocated nodes. For SCU10, this can be one of two values: 1) Request a Haswell node: #SBATCH --constraint=hasw 2) Request a node running the SLES11 SP3 operating system: #SBATCH --constraint=sp3 Note that the Haswell nodes in the old 'sp3' partition have been merged into the general-use 'compute' partition. References to '--partition=sp3' will now fail with an error indicating an invalid partition. If you encounter this problem, simply change: #SBATCH --partition=sp3 to #SBATCH --constraint=hasw or #SBATCH --constraint=sp3 and that should fix the problem. In the future, most of the old Sandy Bridge nodes will be migrated to the SP3 version of SLES11. When that happens, the 'compute' partition will contain a mix of Sandy Bridge and Haswell nodes, and a mix of SP1 and SP3 versions of SLES 11. At that point, if you need to specify a particular processor type for your job (possibly for compatibility reasons), then you _must_ specify the processor type with the --constraint option to sbatch: ... # This command will allocate Haswell (SCU10) nodes. sbatch --constraint=hasw myjob.sh ... or: ... # This command will allocate Sandy Bridge (SCU8,9) nodes. sbatch --constraint=sand myjob.sh ... Since these commands do not specify a partition, they will be sent to the default 'compute' partition, and the --constraint option will ensure that nodes with the proper processor type are allocated to the job. Once Sandy Bridge nodes are migrated to SP3, then --constraint=sp3 will allocate either Sandy Bridge or Haswell nodes, so avoid using that option unless your software must run on SP1. Note that the goal is to move nearly all Sandy Bridge nodes to SP3 by late summer/early fall 2015. However, some SP1-based nodes may remain for several months after that point. A new set of login nodes has been set up to provide easy access to the computing environment that is available on the SCU10 nodes. You can log in to 'discover-sp3' instead of 'discover'. These login nodes are configured identically to the SCU10 compute nodes (28 Haswell cores, 128 GB memory), and so code that you compile on these login nodes should run without problems on the SCU10 Haswell nodes. The fine print: possible user changes for SCU10 ----------------------------------------------- There is a (small but non-zero) chance that your code will run on SCU10 without recompiling, and your job scripts may run without any changes. If that is the case, you're lucky - just use the same code you have always used, but now run it on SCU10. The more likely case is that changes of varying degrees of complexity to your job scripts and/or program code will be required. First, consider your job script. Some of the script changes that may be needed to ensure SCU10 use were described in the previous section. However, the new hardware arrangement (many more cores and _much_ more memory per node, and NO SWAP SPACE) will likely lead to changes to your job submission scripts to take best advantage of the new capabilites. Since the memory/core is higher, and the cores/node is higher, you should examine the distribution of tasks in your job script for possible improvements. For example, if your (old) job script requests 480 Sandy Bridge cores, you job submission script might include the lines: ... #SBATCH --constraint=sand #SBATCH --ntasks=480 ... or the lines: ... #SBATCH --constraint=sand #SBATCH --nodes=30 ... Note that we did not specify the memory per task, since the default task distribution is 1 task per core, and the Sandy Bridge nodes have 2 GB memory per core. To run this job in the 'compute' partition, on the SCU10 Haswell nodes, you should use: ... #SBATCH --constraint=hasw #SBATCH --ntasks=480 ... This set of options will get you CEIL(480/28) = 18 Haswell nodes, or a total of 18*28 = 504 cores. In the default case of 1 task per core, this will lead to 17 SCU10 nodes with 28 tasks each, and 1 SCU10 node with 4 tasks, leaving 24 cores unused on the final node. The point we want to emphasize is that you should try to avoid specifying a node count in your jobs - specify numbers for cores, tasks, and memory, and let SLURM do the allcoations. Ultimately, job scripts that avoid specification of cluster topology will be the most portable, and durable. The more you know about the resource requirements of your job, the better SLURM can ensure you get what you need, when you need it. Next, consider your actual C/FORTRAN/. The SCU10 nodes differ in both hardware and software from the Sandy Bridge nodes on discover. The SLES 11 SP3 operating system on SCU10 is 2 releases beyond the SLES 11 SP1 version on older nodes. As is often the case, operating system upgrades like this can require recompiling your code just to get it to run efficiently (or at all...). In such cases, your existing build procedures should work with minimal changes. However, the new Haswell processors in the SCU10 nodes provide many new features, such as the 512-bit-wide AVX2 vector registers. When properly used, these features can provide significant performance improvements. Therefore, once you have your code operating correctly on SCU10, it will probably be worthwhile to examine your compiler options and other program settings to see if you can take advantage of new features. Proceed slowly and cautiously as you change these settings, and be sure to compare results from new builds with known good results from previous builds to ensure you do not introduce a new problem. We will provide guidance and assistance on optimizing code to use these features as experience with the Haswell nodes is developed. It bears repeating: SCU10 HAS NO SWAP SPACE. The lack of swap space will be largely ameliorated by the much larger installed memory. However, if your code does use all of the node memory, your program will _fail immediately_, rather than trigger a page fault (swap). Therefore, you should already have a good idea of the memory requirements of your code during nominal operation. To help monitor the memory usage of your code, we recommend the use of the "policeme" utility, already in use on discover. Please contact NCCS User Services for instructions on how to use policeme to monitor your memory usage. See the description of the --mem-per-cpu option later in this document for a discussion of how to ensure you have sufficient memory available. If you need to recompile on SCU10, most of the existing compiler and MPI modules should work fine. We have tested all available combinations of Intel compiler and MPI modules, and we are continuing tests with non-Intel compilers and MPI modules. In general, avoid using the three oldest Intel MPI modules: impi-3.2.2.006, 4.0.3.008, 4.1.0.024. These modules are known to cause problems on SCU10 with even a simple MPI program, especially when more than one node is used. Additionally, to use the advanced features of the Haswell cores, you will need to use the most recent compiler modules. Finally, you may need to alter settings in some MPI environment variables for optimum efficiency. In addition to the existing Intel MPI modules, we are building SP3-specific versions of many other MPI modules, suach as MVAPICH and OpenMPI. These will be made available in module form as they are built and tested. The latest set of available MPI modules can always be found with: ... module avail mpi other/mpi ... The SCU10 nodes (only) can also make use of the vendor-supplied (SGI) MPI module, which has proven to be relatively robust: ... module load mpi/sgi-mpt-2.1.1 ... On SCU10, you may encounter run-time problems when large numbers of nodes are used (>300 or so, but YMMV). One potential problem is that teardown of the MPI connections when your program completes can take a very long time (30 minutes or more) when a large number of nodes are used. The severity of these issues varies with the MPI module in use. A note about cron jobs ---------------------- The discover-cron nodes are currently SP1 nodes. Cron jobs which run code recompiled for SCU10 may require the job to ssh to a SCU10 node to work. A note about ensuring sequential job execution ---------------------------------------------- Due to scheduling algorithms (small jobs preferred during the day, large jobs at night), jobs may not execute in the order in which they are submitted. This can cause strange failures in job chains. If a specific job sequence is required, you can use the --dependency=jobid:... option to sbatch. For example, if you have a post-processing job to run after 3 independent processing jobs have completed, you can do something like this in your job script: ... # String to hold the job IDs. job_ids='' # Submit the first parallel processing job, save the job ID. job_id=`sbatch job1.sh | cut -d ' ' -f 4` job_ids="$job_ids:$job_id" # Submit the second parallel processing job, save the job ID. job_id=`sbatch job2.sh | cut -d ' ' -f 4` job_ids="$job_ids:$job_id" # Submit the third parallel processing job, save the job ID. job_id=`sbatch job3.sh | cut -d ' ' -f 4` job_ids="$job_ids:$job_id" # Wait for the processing jobs to finish successfully, then run the # post-processing job. sbatch --dependency=afterok$job_ids postjob.sh ... There are many ways to tailor the dependencies, such as varying exit conditions. See the sbatch man page for details, and ask the NCCS for help if you're having trouble with this feature (we use it ourselves, and it works great). As a temporary alternative to the --dependency option to sbatch, you can use the special 'serial' quality-of-service (--qos=serial). The QoS rules ensure that the jobs are executed one at a time. Coming attraction: shared nodes ------------------------------- Nodes on SCU10 will initially be exclusively allocated. This means that if your job is assigned to one or more SCU10 nodes, your job will be the _only_ job running on those nodes. This is how allocations on discover are currently managed - on older nodes, allocation of a node is exclusive to a single job. However, exclusive allocation can lead to significant inefficiencies in resource use. For example, if you only need 8 cores on a node, then 20 cores are going unused on a Haswell node, and 8 cores on a Sandy Bridge node. Therefore, we plan to eventually transition SCU10 (and possibly older nodes) to a shared operational mode. That is, if you are only using 8 cores on the 28-core SCU10 node, the other 20 cores may be made available to other jobs, even if those jobs are submitted by other users. Shared nodes will not be happening _immediately_, but it will be happening _eventually_. Users should start examining their codes and scripts to determine what the optimal distribution of resources needs to be. This means you have to understand not only how many cores you need, but how much memory each core will require. For example, if you need 20 cores, and each core needs 4 GB of memory, that's a total of 80 GB of memory. That's more cores and memory than was available on any older node on discover, but it still leaves 8 cores and 48 GB of memory available on a Haswell node (40 GB allowing for the operating system). Therefore, when you make your job request, you should include this information as options for sbatch: ... sbatch --ntasks=20 --mem=81920 myjob.sh ... Note that the memory requirement must be specified in megabytes. Alternatively, if your job uses one core per MPI task (the usual case), you can specify the memory on a per-core basis: ... sbatch --ntasks=20 --mem-per-cpu=4096 myjob.sh ... When nodes are shared, either of these commands will allow the unused on-node resources (cores or memory) to be allocated to other jobs. You should also keep in mind that shared nodes may (will?) result in new problems we have not seen before. Interactions between multiple on-node jobs, especially I/O functions, may cause unexpected changes in performance. When this happens, please contact us for assistance in identifying and resolving the problem. Keep the following points in mind: * If you absolutely must have exclusive access to a node, use the --exclusive option to sbatch. We do _not_ want you to do this, as a general rule. It should only be used in an emergency situation, and it would be very helpful if you notified NCCS staff of your desire to use this option. * If you know that (for some reason) you will need a minimum number of cores on each allocated node, use the --mincpus=N option to sbatch. Otherwise, just specify the number of tasks you need, and let SLURM do the core allocation. Coming attractions: SCU11 ------------------------- * 600 nodes, same node specs as SCU10. * Possibly available in March 2015. * Initial availability may be via a SLURM "reservation" for testing. * All nodes should be general-use.