// Performance Analysis Tools
There are multiple ways to improve your program performance:
- Reducing execution time of the program
- Achieving good scaling to larger number of processes
- Achieving good scaling to larger number of processes
The time command
The most important goal of performance tuning is to reduce the program's wall clock time. Use the time command to see the overall timing of your program, for example:
$ time ./executable
... or ...
$ time mpirun -perhost 6 -np 102 ./mpi_executable
The output of the time is in minutes and seconds, e.g.:
real 33m44.268s user 191m53.336s sys 2m3.520s
The realtime is the elapsedtime from the beginning to the end of the program. The user time, also known as CPUtime, is the time used by the program itself and any library subroutines it calls. The sys time is the time used by system calls invoked by the program, including performing I/O, writing to disks, printing to the screen, etc. It is worth it to minimize the system time by speeding up the disk I/O. Doing I/O in parallel or in the background while the CPUs compute in the foreground is a good tuning consideration.
The real * number_of_processor_used_on_the_host should be nearly equal to user + systime. The small difference between them is caused by the I/O time required to bring in the program's text and data, I/O time to acquire real memory for the program to use, or CPU time used by the operating system.
The gprof command
The GNU profiler gprof is a useful tool for locating hot spots in a program. It displays the following information:
- The percentage of CPU time taken by each function and all functions it calls (its descendants).
- A breakdown of time used by each function and its descendants.
- A breakdown of time used by each function and its descendants.
How to use gprof:
- Compile and link all code with the -pg option.
- Run the program as usual. When it completes you should have a binary file called gmon.out, which contains runtime statistics.
- View the profile statistics by typing gprof followed by the name of the executable and the gmon.out file, which will construct a text display of the functions within the program (call tree and CPU time spent in every subroutine).
$ gprof ./myexecutable gmon.out
... or ...
$ gprof ./myexecutable gmon.out > gprof.out
MPI profiling: mpiP
mpiP, a lightweight profiling library for MPI applications, is a very useful tool for collecting statistical information on the MPI functions. mpiP generates considerably less overhead and much less data than tracing tools. It only uses communication during report generation, typically at the end of the experiment, to merge results from all of the tasks into a single output file.
Using mpiP is very simple:
- First, find and load an mpiP module:
$ module spider mpi
If you do not have an appropriate compiler module loaded first, lmod will output a list of modules to choose from; load one of them.
$ module load mpi/impi/<version-number> - The application does not have to be recompiled to use mpiP. mpiP is a link-time library. mpiP will work without -g. Just include the following libraries during the compilation/linking:
mpif90 test.f90 -lmpiP -lm -lbfd -liberty -lunwind -lz
The libraries (-lbfd -liberty ) provide support for decoding the symbol information; they are part of the GNU binutils. - Run your application as usual. You can verify that mpiP is working by looking at the end of the standard output file of your job. You should see something like:
mpiP: mpiP: Storing mpiP output in [./global_fcst.mpip.36.8827.1.mpiP]. mpiP:
Intel Tools
For Intel and Intel MPI performance analysis tools like Advisor, Inspector, itac, and vtune, run:
$ module spider <tool-name>
TAU performance system
TAU is a powerful performance evaluation tool, which supports both parallel profiling and tracing. Profiling and tracing allows you to not only measure time but also to see hardware performance counters from the CPUs.
TAU can automatically instrument your source code, including routines, loops, I/O, memory, etc. To use TAU's automatic source instrumentation, you will have to set some environmental variables and substitute the compiler name with a TAU shell script. For detailed information on TAU, refer to http://www.cs.uoregon.edu/Research/tau/home.php
HOW TO USE TAU TO GENERATE A FLAT AND CALLTREE PROFILE
- First, set the root path for TAU in your shell startup file. For csh/tcsh:
setenv PATH ${PATH}:/usr/local/other/tau/2.30.1/impi/2021.4.0/x86_64/bin
- Then, set TAU_MAKEFILE either in your shell startup file, or in your code compilation script. For csh/tcsh,
$ setenv TAU_MAKEFILE /usr/local/other/tau/2.30.1/impi/2021.4.0/x86_64/lib/ Makefile.tau-callpath-icpc-mpi-compensate-pdt
- Compile your code with a TAU shell script. You may edit your Makefile and change "ifort" or "mpif90" to tau_f90.sh, i.e., change from "mpif90 $(OPTS) -o ./myexec myprog.f90" to:
$ tau_f90.sh -o ./myexec myprog.f90
- If you only need a flat profile, just run your application as usual (with mpirun and qsub). To be able to see a callpath profile, set the two variables below before running the application.
$ setenv TAU_CALLPATH 1
This is to enable callpath profiling. Default is 0 - disable.$ setenv TAU_CALLPATH_DEPTH 100
TAU will record each event callpath to the depth set by the TAU_CALLPATH_DEPTH, default is 2. - After the job is complete, there will be multiple profile data files generated, differentiated with the rank of CPUs, named profile.x.0.0.
- Use Paraprof, a parallel profile visualization tool bundled with the TAU package, to display the profile data. Run the following in the same directory where profile.x.0.0 files were generated. Launching ParaProf will bring up a manager window and a window displaying the profile data.
$ paraprof --pack app.ppk $ paraprof app.ppk
HOW TO USE TAU TO CREATE AN EVENT TRACE
- Follow steps 1-3 in the previous example.
- Enable the trace by setting TAU_TRACE and then run the application.
$ setenv TAU_TRACE 1
- After the job is complete, there will be multiple trace data files generated, differentiated with the rank of CPUs, named tautrace.x.0.0.trc.
- In the same directory where the binary trace data were generated, first merge the binary traces from different CPUs to create two summary files, tau.trc and tau.edf with "tau_treemerge.pl", and then convert TAU traces to slog2 format with "tau2slog2":
$ tau_treemerge.pl
$ tau2slog2 tau.trc tau.edf -o app.slog2 - Launching Jumpshot , a java-based tracefile visualization tool also bundled with the TAU package, will bring up the main display window showing the entire trace.
$ jumpshot app.slog2