This chapter presents a tutorial for using the SpeedShop tools to gather and analyze performance data in a Fortran program, and covers these topics:
"Tutorial Overview" introduces the sample program and explains the different scenarios in which it will be used.
"Tutorial Setup" steps you through the necessary setup for running the experiment.
"Analyzing Performance Data" steps you through five different experiments, discussing first how to do the experiments, then how to interpret the results.
Note: Because of inherent differences between systems and also due to concurrent processes that may be running on your system, your experiment will produce different results from the one in this tutorial. However, the basic form of the results should be the same. |
This tutorial is based on a standard benchmark program called linpackup. There are three versions of the program: the linpack directory contains files for the n32-bit ABI, the linpack64 directory contains files for the 64-bit ABI and the linpacko32 directory contains files for the 32-bit ABI. Each linpack directory contains versions of the program for a single processor (linpackup) and for multiple processors (linpackd). When you work with the tutorial, choose the version of the program that is most appropriate for your system. A good guideline is to choose whichever version corresponds to the way you expect to develop your programs.
This tutorial was written and tested using the single-processor version of the program (linpackup) in the linpack directory.
The linpack program is a standard benchmark designed to measure CPU performance in solving dense linear equations. The program focuses primarily on floating point performance.
Output from the linpackup program looks like the following:
... norm. resid resid machep x(1) x(n) 5.35882395E+00 7.13873405E-13 2.22044605E-16 1.00000000E+00 1.00000000E+00 times are reported for matrices of order 300 dgefa dgesl total mflops unit ratio times for array with leading dimension of 301 1.180E+00 1.000E-02 1.190E+00 1.528E+01 1.309E-01 2.125E+01 1.180E+00 1.000E-02 1.190E+00 1.528E+01 1.309E-01 2.125E+01 1.180E+00 1.000E-02 1.190E+00 1.528E+01 1.309E-01 2.125E+01 1.180E+00 1.000E-02 1.190E+00 1.528E+01 1.309E-01 2.125E+01 times for array with leading dimension of 300 1.180E+00 1.000E-02 1.190E+00 1.528E+01 1.309E-01 2.125E+01 1.180E+00 2.000E-02 1.200E+00 1.515E+01 1.320E-01 2.143E+01 1.180E+00 1.000E-02 1.190E+00 1.528E+01 1.309E-01 2.125E+01 1.181E+00 1.200E-02 1.193E+00 1.524E+01 1.312E-01 2.130E+01 |
Copy the program to a directory where you have write permission and compile it so that you can use it in the tutorial.
Change to the /usr/demos/SpeedShop directory.
Copy the appropriate linpack directory and its contents to a directory where you have write permission:
cp -r linpack_version your_dir |
Change to the directory you just created:
cd your_dir/linpack_version |
Compile the program by entering
make all |
This provides an executable for the experiment.
This section provides steps on how to run the following experiments on the linpackup program and generate and interpret the results:
You can follow the tutorial from start to finish, or you can follow steps for just the experiment(s) you want.
This section takes you through the steps to perform a usertime experiment. The usertime experiment allows you to gather data on the amount of user time spent in each function in your program. For more information on usertime, see the "usertime Experiment" section in Chapter 4, "Experiment Types."
From the command line, enter
ssrun -v -usertime linpackup |
This starts the experiment. The -v flag tells ssrun to print a log to stderr.
Output from linpackup and from ssrun is printed to stdout as shown in the example below. A data file is also generated. The name consists of the process name (linpackup), the experiment type, usertime, and the experiment ID. In this example, the filename is linpackup.usertime.m17566.
ssrun: target PID 17566 ssrun: setenv _SPEEDSHOP_MARCHING_ORDERS ut:cu ssrun: setenv _SPEEDSHOP_EXPERIMENT_TYPE usertime ssrun: setenv _SPEEDSHOP_TARGET_FILE linpackup ssrun: setenv _RLD_LIST libss.so:libssrt.so:DEFAULT Please send the results of this run to: Jack J. Dongarra Mathematics and Computer Science Division Argonne National Laboratory Argonne, Illinois 60439 Telephone: 312-972-7246 ARPAnet: DONGARRA@ANL-MCS norm. resid resid machep x(1) x(n) 5.35882395E+00 7.13873405E-13 2.22044605E-16 1.00000000E+00 1.00000000E+00 times are reported for matrices of order 300 dgefa dgesl total mflops unit ratio times for array with leading dimension of 301 3.050E+00 3.000E-02 3.080E+00 5.903E+00 3.388E-01 5.500E+01 3.030E+00 3.000E-02 3.060E+00 5.941E+00 3.366E-01 5.464E+01 3.030E+00 3.000E-02 3.060E+00 5.941E+00 3.366E-01 5.464E+01 3.030E+00 3.000E-02 3.060E+00 5.941E+00 3.366E-01 5.464E+01 times for array with leading dimension of 300 3.030E+00 3.000E-02 3.060E+00 5.941E+00 3.366E-01 5.464E+01 3.040E+00 3.000E-02 3.070E+00 5.922E+00 3.377E-01 5.482E+01 3.040E+00 3.000E-02 3.070E+00 5.922E+00 3.377E-01 5.482E+01 3.034E+00 3.000E-02 3.064E+00 5.933E+00 3.371E-01 5.471E+01 |
To generate a report on the data collected, enter the following at the command line:
prof your_output_file_name > usertime.results |
prof interprets the type of experiment you have performed and prints results to stdout. The report below shows partial prof output:
----------------------------------------------------------------------- Profile listing generated Mon Nov 18 11:39:36 1996 with: prof linpackup.usertime.m17566 ----------------------------------------------------------------------- Total Time (secs) : 115.11 Total Samples : 3837 Stack backtrace failed: 2 Sample interval (ms) : 30 CPU : R4600 FPU : R4600 Clock : 100.0MHz Number of CPUs : 1 ----------------------------------------------------------------------- index %Samples self descendents total name [1] 99.9% 0.00 115.05 3835 __start [2] 99.9% 0.00 115.05 3835 main [3] 99.9% 0.00 115.05 3835 linp [4] 94.9% 3.12 106.17 3643 dgefa [5] 92.7% 106.71 0.00 3557 daxpy [6] 3.8% 4.35 0.00 145 matgen [7] 0.9% 0.00 1.08 36 dgesl [8] 0.2% 0.27 0.00 9 dscal [9] 0.2% 0.27 0.00 9 idamax [10] 0.2% 0.00 0.18 6 s_wsfe64 [11] 0.2% 0.00 0.18 6 s_wsfe_com [12] 0.2% 0.00 0.18 6 wsfe [13] 0.1% 0.03 0.09 4 f_init [14] 0.1% 0.00 0.06 2 f77canseek [15] 0.1% 0.03 0.03 2 _isatty [16] 0.1% 0.06 0.00 2 dmxpy [17] 0.1% 0.03 0.03 2 s_stop [18] 0.0% 0.03 0.00 1 _mips2_test_and_set [19] 0.0% 0.00 0.03 1 _ftell64 [20] 0.0% 0.00 0.03 1 memset [21] 0.0% 0.03 0.00 1 _blk_init [22] 0.0% 0.03 0.00 1 __oserror [23] 0.0% 0.03 0.00 1 c_sfe [24] 0.0% 0.00 0.03 1 do_ud [25] 0.0% 0.00 0.03 1 check_buflen [26] 0.0% 0.00 0.03 1 _malloc [27] 0.0% 0.00 0.03 1 __malloc [28] 0.0% 0.03 0.00 1 _morecore [29] 0.0% 0.00 0.03 1 pars_f [30] 0.0% 0.00 0.03 1 f_s [31] 0.0% 0.00 0.03 1 f_list [32] 0.0% 0.00 0.03 1 i_tem [33] 0.0% 0.00 0.03 1 ne_d [34] 0.0% 0.03 0.00 1 op_gen [35] 0.0% 0.00 0.03 1 do_fioxr8v [36] 0.0% 0.00 0.03 1 do_fio64_mp [37] 0.0% 0.03 0.00 1 w_ed [38] 0.0% 0.03 0.00 1 f_exit |
The report shows information for each function. The function names are show in the right-hand column of the report. The remaining columns are described below:
The index column provides an index number for reference.
The %time column shows the cumulative percentage of inclusive time spent in each function and its descendents. For example, in the third row, 99.9% of the time was spent in linp and all functions listed below it.
The self column shows how much time, in seconds, was spent in the function itself (exclusive time). For example, less than one hundredth of a second was spent in linp, but 3.12 seconds were spent in dgefa.
The descendents columns shows how much time, in seconds, was spent in callees of the function. For example, in the third row, 115.05 seconds were spent in callees of linp.
The total column provides information on the number of cycles out of the total spent on the function.
Note: Many functions shown here have only one or two "hits." The data for those functions is not statistically significant. |
This section takes you through the steps to perform a pcsamp experiment. The pcsamp experiment allows you to gather information on actual CPU time for each source code line, machine line, and function in your program. For more information on pcsamp, see the "pcsamp Experiment" section in Chapter 4, "Experiment Types."
From the command line, enter
ssrun -pcsamp linpackup |
This starts the experiment.
Output from linpackup and from ssrun is printed to stdout as shown in the example below. A data file is also generated. The name consists of the process name (linpackup), the experiment type, pcsamp, and the experiment ID. In this example, the filename is linpackup.pcsamp.m17576.
... norm. resid resid machep x(1) x(n) 5.35882395E+00 7.13873405E-13 2.22044605E-16 1.00000000E+00 1.00000000E+00 ... |
To generate a report on the data collected, enter the following at the command line:
prof your_output_file_name > pcsamp.results |
------------------------------------------------------------------------------- Profile listing generated Sun May 19 18:38:50 1996 with: prof linpackup.pcsamp.m17576 ------------------------------------------------------------------------------- samples time CPU FPU Clock N-cpu S-interval Countsize 5421 54s R8000 R8010 75.0MHz 1 10.0ms 2(bytes) Each sample covers 4 bytes for every 10.0ms ( 0.02% of 54.2100s) ------------------------------------------------------------------------------- -p[rocedures] using pc-sampling. Sorted in descending order by the number of samples in each procedure. Unexecuted procedures are excluded. ------------------------------------------------------------------------------- samples time(%) cum time(%) procedure (dso:file) 5064 51s( 93.4) 51s( 93.4) daxpy (linpackup:linpackup.f) 240 2.4s( 4.4) 53s( 97.8) matgen (linpackup:linpackup.f) 76 0.76s( 1.4) 54s( 99.2) dgefa (linpackup:linpackup.f) 19 0.19s( 0.4) 54s( 99.6) dscal (linpackup:linpackup.f) 17 0.17s( 0.3) 54s( 99.9) idamax (linpackup:linpackup.f) 4 0.04s( 0.1) 54s(100.0) dmxpy (linpackup:linpackup.f) 1 0.01s( 0.0) 54s(100.0) _ioctl (/usr/lib32/libc.so.1: /work/irix/lib/libc/libc_n32_M4/sys/ioctl.s) 5421 54s(100.0) 54s(100.0) TOTAL |
The report has the following columns:
The samples column shows how many samples were taken when the process was executing in the function.
The time(%) column shows the amount of time, and the percentage of that time over the total time that was spent in the function.
The cum time(%) column shows how much time has been spent in the function up to and including the procedure in the list.
The procedure (dso:file) column lists the procedure, its DSO name and filename. For example, the first line reports statistics for the procedure daxpy in the file linpackup.f in the linpackup executable.
Note: This experiment can be performed only on systems that have built-in hardware counters (the R10000 class of machines). |
This section takes you through the steps to perform a hardware counter experiment. There are a number of hardware counter experiments, but this tutorial describes the steps involved in performing the dsc_hwc experiment. This experiment allows you to capture information about secondary data cache misses. For more information on hardware counter experiments, see the"ideal Experiment" section in Chapter 4, "Experiment Types."
From the command line, enter
ssrun -dsc_hwc linpackup |
This starts the experiment. Output from linpackup and from ssrun will be printed to stdout. A data file is also generated. The name consists of the process name (linpackup), the experiment type, dsc_hwc, and the experiment ID. In this example, the filename is linpackup.dsc_hwc.m6180.
To generate a report on the data collected, enter the following at the command line:
prof your_output_file_name > dsc_hwc.results |
Output similar to the following is generated:
------------------------------------------------------------------------------- Profile listing generated Sun May 19 18:20:14 1996 with: prof linpackup.dsc_hwc.m6180 ------------------------------------------------------------------------------- Counter : Sec cache D misses Counter overflow value: 131 Total numer of ovfls : 2737 CPU : R10000 FPU : R10010 Clock : 196.0MHz Number of CPUs : 1 ------------------------------------------------------------------------------- -p[rocedures] using counter overflow. Sorted in descending order by the number of overflows in each procedure. Unexecuted procedures are excluded. ------------------------------------------------------------------------------- overflows(%) cum overflows(%) procedure (dso:file) 2133( 77.9) 2133( 77.9) DAXPY (linpackup:/usr/demos/ SpeedShop/linpack64/linpackup.f) 307( 11.2) 2440( 89.1) MATGEN (linpackup:/usr/demos/ SpeedShop/linpack64/linpackup.f) 275( 10.0) 2715( 99.2) DGEFA (linpackup:/usr/demos/ SpeedShop/linpack64/linpackup.f) 11( 0.4) 2726( 99.6) IDAMAX (linpackup:/usr/demos/ SpeedShop/linpack64/linpackup.f) 3( 0.1) 2729( 99.7) DMXPY (linpackup:/usr/demos/ SpeedShop/linpack64/linpackup.f) 3( 0.1) 2732( 99.8) DGESL (linpackup:/usr/demos/ SpeedShop/linpack64/linpackup.f) 1( 0.0) 2733( 99.9) memset (/usr/lib64/libc.so.1: /work/irix/lib/libc/libc_64_M4/strings/bzero.s) 1( 0.0) 2734( 99.9) fflush (/usr/lib64/libc.so.1: /work/irix/lib/libc/libc_64_M4/stdio/flush.c) 1( 0.0) 2735( 99.9) _mixed_dtoa (/usr/lib64/libc.so.1: /work/irix/lib/libc/libc_64_M4/math/mixed_dtoa.c) 1( 0.0) 2736(100.0) wsfe (/usr/lib64/libftn.so: /work/cmplrs/libI77/wsfe.c) 1( 0.0) 2737(100.0) f_exit (/usr/lib64/libftn.so: /work/cmplrs/libI77/close.c) 2737(100.0) TOTAL |
The report has the following columns:
The overflows(%) column shows the number of overflows caused by the function, and the percentage of that number over the total number of overflows in the program.
The cum overflows(%) column shows a cumulative number and percentage of overflows. For example, the MATGEN function shows 307 overflows, but the cumulative number of overflows is 2440.
The procedure (dso:file) column shows the procedure name and the DSO and filename that contain the procedure.
This section takes you through the steps to perform an ideal experiment. For more information on collecting ideal-time data, and basic block counting, see the "ideal Experiment" section in Chapter 4, "Experiment Types."
From the command line, enter
ssrun -ideal linpackup |
This starts the experiment. First the executable and libraries are instrumented using pixie. This entails making copies of the libraries and executables, giving them an extension that depends on the ABI, and inserting information into the copies. The extension is .pixie for the executable, .pix32 for all 32 libraries, .pixn32 for all n32 libraries, and .pix64 for all 64 libraries.
Output from linpackup and from ssrun is printed to stdout as shown in the example below. A data file is also generated. The name consists of the process name (linpackup), the experiment type, ideal, and the experiment ID. In this example, the filename is linpackup.ideal.n17580.
Beginning libraries /usr/lib32/libssrt.so /usr/lib32/libss.so /usr/lib32/libftn.so /usr/lib32/libm.so /usr/lib32/libc.so.1 Ending libraries, beginning "linpackup" ... |
To generate a report on the data collected, enter the following at the command line:
prof your_output_file_name > ideal.results |
This command redirects output to a file called ideal.results. The file should contain results that look something like the following:
Prof run at: Sun May 19 18:46:10 1996 Command line: prof linpackup.ideal.m17580 5722510379: Total number of cycles 76.30014s: Total execution time 4906763725: Total number of instructions executed 1.166: Ratio of cycles / instruction 75: Clock rate in MHz R8000: Target processor modelled --------------------------------------------------------- Procedures sorted in descending order of cycles executed. Unexecuted procedures are not listed. Procedures beginning with *DF* are dummy functions and represent init, fini and stub sections. --------------------------------------------------------- cycles(%) cum % secs instrns calls procedure(dso:file) 5404032607(94.43) 94.43 72.05 4639092022 772633 daxpy(linpackup.pixie: linpackup.f) 207582228( 3.63) 98.06 2.77 157405518 18 matgen(linpackup.pixie: linpackup.f) 67844858( 1.19) 99.25 0.90 72325769 17 dgefa(linpackup.pixie: linpackup.f) 19920277( 0.35) 99.60 0.27 17658342 5083 dscal(linpackup.pixie: linpackup.f) 18115251( 0.32) 99.91 0.24 15675343 5083 idamax(linpackup.pixie: linpackup.f) 4053920( 0.07) 99.98 0.05 3605124 1 dmxpy(linpackup.pixie: linpackup.f) 786709( 0.01) 100.00 0.01 695776 17 dgesl(linpackup.pixie: linpackup.f) 41357( 0.00) 100.00 0.00 83826 1116 __flsbuf(./libc.so.1.pixn32: /work/irix/lib/libc/libc_n32_M4/stdio/_flsbuf.c) 30294( 0.00) 100.00 0.00 29094 1 linp(linpackup.pixie: linpackup.f) 17330( 0.00) 100.00 0.00 39823 867 x_putc(./libftn.so.pixn32: /lv7/mtibuild/nodebug/workarea/mongoose/libI77/wsfe.c) 12294( 0.00) 100.00 0.00 25795 28 x_wEND(./libftn.so.pixn32: /lv7/mtibuild/nodebug/workarea/mongoose/libI77/wsfe.c) 10620( 0.00) 100.00 0.00 14340 53 wrt_E(./libftn.so.pixn32: /lv7/mtibuild/nodebug/workarea/mongoose/libI77/wrtfmt.c) 9617( 0.00) 100.00 0.00 14889 71 do_fio64_mp (./libftn.so.pixn32:/lv7/mtibuild/nodebug/workarea/mongoose/libI77/ fmt.c) 4940( 0.00) 100.00 0.00 7917 380 |
The report has the following columns:
The cycles(%) column reports the number and percentage of machine cycles used for the procedure. For example, 5404032607 cycles, or 94.43% of cycles were spent in the daxpy procedure.
The cum% column shows the cumulative percentage of cycles. For example, 98.06% of all cycles were spent between the top two functions in the listing: daxpy and matgen.
The secs column shows the number of seconds spent in the procedure. For example, 72.05 seconds were spent in the daxpy procedure. The time represents an idealized computation based on modelling the machine. It ignores potential floating point interlocks and memory latency time (cache misses and memory bus contention.)
The instrns column shows the number of instructions executed for a procedure. For example, there were 157405518 instructions devoted to the matgen procedure.
The calls column reports the number of calls to the procedure. For example, there were 18 calls to the matgen procedure.
The procedure (dso:file) column lists the procedure, its DSO name and filename. For example, the first line reports statistics for the procedure daxpy in the file linpackup.f in the linpackup executable.
This concludes the tutorial.