// Performance Analysis Tools

There are multiple ways to improve your program performance:

Reducing execution time of the program
Achieving good scaling to larger number of processes
Achieving good scaling to larger number of processes

The time command

The most important goal of performance tuning is to reduce the program's wall clock time. Use the time command to see the overall timing of your program, for example: $ time ./executable ... or ... $ time mpirun -perhost 6 -np 102 ./mpi_executable

The output of the time is in minutes and seconds, e.g.: real 33m44.268s user 191m53.336s sys 2m3.520s

The realtime is the elapsedtime from the beginning to the end of the program. The user time, also known as CPUtime, is the time used by the program itself and any library subroutines it calls. The sys time is the time used by system calls invoked by the program, including performing I/O, writing to disks, printing to the screen, etc. It is worth it to minimize the system time by speeding up the disk I/O. Doing I/O in parallel or in the background while the CPUs compute in the foreground is a good tuning consideration.

The real * number_of_processor_used_on_the_host should be nearly equal to user + systime. The small difference between them is caused by the I/O time required to bring in the program's text and data, I/O time to acquire real memory for the program to use, or CPU time used by the operating system.

The gprof command

The GNU profiler gprof is a useful tool for locating hot spots in a program. It displays the following information:

The percentage of CPU time taken by each function and all functions it calls (its descendants).
A breakdown of time used by each function and its descendants.
A breakdown of time used by each function and its descendants.

How to use gprof:

Compile and link all code with the -pg option.
Run the program as usual. When it completes you should have a binary file called gmon.out, which contains runtime statistics.
View the profile statistics by typing gprof followed by the name of the executable and the gmon.out file, which will construct a text display of the functions within the program (call tree and CPU time spent in every subroutine). $ gprof ./myexecutable gmon.out ... or ... $ gprof ./myexecutable gmon.out > gprof.out

MPI profiling: mpiP

mpiP, a lightweight profiling library for MPI applications, is a very useful tool for collecting statistical information on the MPI functions. mpiP generates considerably less overhead and much less data than tracing tools. It only uses communication during report generation, typically at the end of the experiment, to merge results from all of the tasks into a single output file.

Using mpiP is very simple:

First, find and load an mpiP module: $ module spider mpi $ module load mpi/impi/<version-number> If you do not have an appropriate compiler module loaded first, lmod will output a list of modules to choose from; load one of them.
The application does not have to be recompiled to use mpiP. mpiP is a link-time library. mpiP will work without -g. Just include the following libraries during the compilation/linking: mpif90 test.f90 -lmpiP -lm -lbfd -liberty -lunwind -lz The libraries (-lbfd -liberty ) provide support for decoding the symbol information; they are part of the GNU binutils.
Run your application as usual. You can verify that mpiP is working by looking at the end of the standard output file of your job. You should see something like: mpiP: mpiP: Storing mpiP output in [./global_fcst.mpip.36.8827.1.mpiP]. mpiP:

Intel Tools

For Intel and Intel MPI performance analysis tools like Advisor, Inspector, itac, and vtune, run: $ module spider <tool-name>

TAU performance system

TAU is a powerful performance evaluation tool, which supports both parallel profiling and tracing. Profiling and tracing allows you to not only measure time but also to see hardware performance counters from the CPUs.

TAU can automatically instrument your source code, including routines, loops, I/O, memory, etc. To use TAU's automatic source instrumentation, you will have to set some environmental variables and substitute the compiler name with a TAU shell script. For detailed information on TAU, refer to http://www.cs.uoregon.edu/Research/tau/home.php

HOW TO USE TAU TO GENERATE A FLAT AND CALLTREE PROFILE

First, set the root path for TAU in your shell startup file. For csh/tcsh: setenv PATH ${PATH}:/usr/local/other/tau/2.30.1/impi/2021.4.0/x86_64/bin
Then, set TAU_MAKEFILE either in your shell startup file, or in your code compilation script. For csh/tcsh,$ setenv TAU_MAKEFILE /usr/local/other/tau/2.30.1/impi/2021.4.0/x86_64/lib/ Makefile.tau-callpath-icpc-mpi-compensate-pdt
Compile your code with a TAU shell script. You may edit your Makefile and change "ifort" or "mpif90" to tau_f90.sh, i.e., change from "mpif90 $(OPTS) -o ./myexec myprog.f90" to: $ tau_f90.sh -o ./myexec myprog.f90
If you only need a flat profile, just run your application as usual (with mpirun and qsub). To be able to see a callpath profile, set the two variables below before running the application. $ setenv TAU_CALLPATH 1This is to enable callpath profiling. Default is 0 - disable. $ setenv TAU_CALLPATH_DEPTH 100 TAU will record each event callpath to the depth set by the TAU_CALLPATH_DEPTH, default is 2.
After the job is complete, there will be multiple profile data files generated, differentiated with the rank of CPUs, named profile.x.0.0.
Use Paraprof, a parallel profile visualization tool bundled with the TAU package, to display the profile data. Run the following in the same directory where profile.x.0.0 files were generated. Launching ParaProf will bring up a manager window and a window displaying the profile data. $ paraprof --pack app.ppk $ paraprof app.ppk

HOW TO USE TAU TO CREATE AN EVENT TRACE

Follow steps 1-3 in the previous example.
Enable the trace by setting TAU_TRACE and then run the application. $ setenv TAU_TRACE 1
After the job is complete, there will be multiple trace data files generated, differentiated with the rank of CPUs, named tautrace.x.0.0.trc.
In the same directory where the binary trace data were generated, first merge the binary traces from different CPUs to create two summary files, tau.trc and tau.edf with "tau_treemerge.pl", and then convert TAU traces to slog2 format with "tau2slog2": $ tau_treemerge.pl $ tau2slog2 tau.trc tau.edf -o app.slog2
Launching Jumpshot , a java-based tracefile visualization tool also bundled with the TAU package, will bring up the main display window showing the entire trace. $ jumpshot app.slog2

NASA Center for Climate Simulation

High Performance Computing for Science

// Performance Analysis Tools

The time command

The gprof command

MPI profiling: mpiP

Intel Tools

TAU performance system

NCCS CONTACTS

ABOUT US

CONTACT US