// Performance Analysis Tools

There are multiple ways to improve your program performance:

  • Reducing execution time of the program
  • Achieving good scaling to larger number of processes
  • Achieving good scaling to larger number of processes

The time command

The most important goal of performance tuning is to reduce the program's wall clock time. Use the time command to see the overall timing of your program, for example: $ time ./executable
... or ...
$ time mpirun -perhost 6 -np 102 ./mpi_executable

The output of the time is in minutes and seconds, e.g.: real   33m44.268s user 191m53.336s sys   2m3.520s

The realtime is the elapsedtime from the beginning to the end of the program. The user time, also known as CPUtime, is the time used by the program itself and any library subroutines it calls. The sys time is the time used by system calls invoked by the program, including performing I/O, writing to disks, printing to the screen, etc. It is worth it to minimize the system time by speeding up the disk I/O. Doing I/O in parallel or in the background while the CPUs compute in the foreground is a good tuning consideration.

The real * number_of_processor_used_on_the_host should be nearly equal to user + systime. The small difference between them is caused by the I/O time required to bring in the program's text and data, I/O time to acquire real memory for the program to use, or CPU time used by the operating system.

The gprof command

The GNU profiler gprof is a useful tool for locating hot spots in a program. It displays the following information:

  • The percentage of CPU time taken by each function and all functions it calls (its descendants).
  • A breakdown of time used by each function and its descendants.
  • A breakdown of time used by each function and its descendants.

How to use gprof:

  1. Compile and link all code with the -pg option.
  2. Run the program as usual. When it completes you should have a binary file called gmon.out, which contains runtime statistics.
  3. View the profile statistics by typing gprof followed by the name of the executable and the gmon.out file, which will construct a text display of the functions within the program (call tree and CPU time spent in every subroutine). $ gprof ./myexecutable gmon.out
    ... or ...
    $ gprof ./myexecutable gmon.out > gprof.out

MPI profiling: mpiP

mpiP, a lightweight profiling library for MPI applications, is a very useful tool for collecting statistical information on the MPI functions. mpiP generates considerably less overhead and much less data than tracing tools. It only uses communication during report generation, typically at the end of the experiment, to merge results from all of the tasks into a single output file.

Using mpiP is very simple. Here are the steps:

  1. First, load mpip module along with other required modules: $ module load comp/intel-15.0.2.164
    $ module load mpi/impi-5.1.1.109
    $ module load other/mpip
  2. The application does not have to be recompiled to use mpiP. mpiP is a link-time library. mpiP will work without -g. Just include the following libraries during the compilation/linking: mpif90 test.f90 -lmpiP -lm -lbfd -liberty -lunwind -lz The libraries (-lbfd -liberty ) provide support for decoding the symbol information; they are part of the GNU binutils.
  3. Run your application as usual. You can verify that mpiP is working by looking at the end of the standard output file of your job. You should see something like: mpiP: mpiP: Storing mpiP output in [./global_fcst.mpip.36.8827.1.mpiP]. mpiP:

TAU performance system

TAU is a powerful performance evaluation tool, which supports both parallel profiling and tracing. Profiling and tracing allows you to not only measure time but also to see hardware performance counters from the CPUs.

TAU can automatically instrument your source code, including routines, loops, I/O, memory, etc. To use TAU's automatic source instrumentation, you will have to set some environmental variables and substitute the compiler name with a TAU shell script. For detailed information on TAU, refer to http://www.cs.uoregon.edu/Research/tau/home.php

HOW TO USE TAU TO GENERATE A FLAT AND CALLTREE PROFILE

  1. First of all, set the root path for TAU in your shell startup file. For csh/tcsh: setenv PATH ${PATH}:/usr/local/other/tau/intel-11.0.083_impi-3.2.2.006/tau-2.19.2/x86_64/bin
  2. Then set TAU_MAKEFILE either in your shell startup file, or in your code compilation script. For csh/tcsh,$ setenv TAU_MAKEFILE /usr/local/other/tau/intel-11.0.083_impi-3.2.2.006/tau-2.19.2/x86_64/lib/ Makefile.tau-callpath-icpc-mpi-compensate-pdt
  3. Compile your code with a TAU shell script. You may edit your Makefile and change "ifort" or "mpif90" to tau_f90.sh, i.e., change from "mpif90 $(OPTS) -o ./myexec myprog.f90" to: $ tau_f90.sh -o ./myexec myprog.f90
  4. If you only need a flat profile, just run your application as usual (with mpirun and qsub). To be able to see a callpath profile, set the two variables below before running the application. $ setenv TAU_CALLPATH 1This is to enable callpath profiling. Default is 0 - disable. $ setenv TAU_CALLPATH_DEPTH 100 TAU will record each event callpath to the depth set by the TAU_CALLPATH_DEPTH, default is 2.
  5. After the job is complete, there will be multiple profile data files generated, differentiated with the rank of CPUs, named profile.x.0.0.
  6. Use Paraprof, a parallel profile visualization tool bundled with the TAU package, to display the profile data. Run the following in the same directory where profile.x.0.0 files were generated. Launching ParaProf will bring up a manager window and a window displaying the profile data. $ paraprof --pack app.ppk $ paraprof app.ppk

HOW TO USE TAU TO CREATE AN EVENT TRACE

  1. Follow steps 1-3 in the previous example.
  2. Enable the trace by setting TAU_TRACE and then run the application. $ setenv TAU_TRACE 1
  3. After the job is complete, there will be multiple trace data files generated, differentiated with the rank of CPUs, named tautrace.x.0.0.trc.
  4. In the same directory where the binary trace data were generated, first merge the binary traces from different CPUs to create two summary files, tau.trc and tau.edf with "tau_treemerge.pl", and then convert TAU traces to slog2 format with "tau2slog2": $ tau_treemerge.pl
    $ tau2slog2 tau.trc tau.edf -o app.slog2
  5. Launching Jumpshot , a java-based tracefile visualization tool also bundled with the TAU package, will bring up the main display window showing the entire trace. $ jumpshot app.slog2

Scalasca

Scalasca is a powerful and user-friendly GUI for profiling. Scalasca modules in Discover can be seen by "module avail other/scalasca". Scalasca can be used as illustrated in the following: $ module load comp/intel-15.0.3.187
$ module load mpi/impi-4.1.0.024
$ module load other/scorep-2.0.2-intel-15.0.3.187_impi-4.1.0.024
$ module load other/scalasca-2.3.1-intel-15.0.3.187_impi-4.1.0.024
$ scalasca -instrument mpif90 Filename.f90

If you have more files, modify your Makefile by adding the first two terms: # Specify the correct number of processors and execution filename
$ scalasca -analyze mpirun -np 2 ./a.out
# dir_name is the name of directory created by scalasca
$ scalasca -examine dir_name