This chapter presents a tutorial for using the SpeedShop tools to gather and analyze performance data in a C program, and covers these topics:
"Tutorial Overview" introduces the sample program and explains the different scenarios in which it will be used.
"Tutorial Setup" steps you through the necessary setup for running the experiment.
"Analyzing Performance Data" steps you through five different experiments, discussing first how to do the experiments, then how to interpret the results.
Note: Because of inherent differences between systems and because of concurrent processes that may be running on your system, your experiment will produce different results from the one in this tutorial. However, the basic form of the results should be the same. |
This tutorial uses a sample program called generic. There are three versions of the program:
generic directory | Contains files for the n32-bit ABI |
generic64 directory | Contains files for the 64-bit ABI |
generico32 directory | Contains files for the (old) 32-bit ABI |
When you work with the tutorial, choose the version of generic most appropriate for your system. A good guideline is to choose the version that corresponds to the way you expect to develop your programs.
This tutorial was written and tested using the version of generic in the generic directory.
generic was designed as a test and demonstration application. It contains code to run scenarios that each test a different area of SpeedShop. The version of generic used in this tutorial performs scenarios that:
build a linked list of structures
use a lot of usertime
scan a directory and run the stat command on each file
perform file I/O
generate a number of floating point exceptions
link and call a routine in a DSO
Output from the program looks like the following:
0:00:00.000 ======== (24479) Begin script Fri 03 May 96 10:17:13. begin script `ll.u.d.i.f.dso' 0:00:00.032 ======== (24479) start of linklist Sun 03 May 96 10:17:13. linklist completed. 0:00:00.002 ======== (24479) start of usrtime Fri 03 May 96 10:17:13. usertime completed. 0:00:10.824 ======== (24479) start of dirstat Fri 03 May 96 10:17:24. dirstat of /usr/include completed, 242 files. 0:00:10.844 ======== (24479) start of iofile Fri 03 May 96 10:17:24. iofile on /unix completed, 4221656 chars. 0:00:11.036 ======== (24479) start of fpetraps Fri 03 May 96 10:17:24. fpetraps completed. 0:00:11.038 ======== (24479) start of libdso Fri 03 May 96 10:17:24. dlslave_init executed dlslave_routine executed slaveusertime completed, x = 5000000.000000. libdso: dynamic routine returned 13 end of script `u.d.i.f.dso' 0:00:11.930 ======== (24479) End script Fri 03 May 96 10:17:25. |
Copy the program to a directory where you have write permission and compile it so that you can use it in the tutorial.
Change to the /usr/demos/SpeedShop directory.
Copy the appropriate generic directory and its contents to a directory where you have write permission:
cp -r generic_version your_dir |
Change to the directory you just created:
cd your_dir/generic_version |
Compile the program by entering
make all |
This provides an executable for the experiment.
Set the library path so that the program can find shared libraries in the generic directory:
setenv LD_LIBRARY_PATH your_dir/generic_version |
This section explains how to run the following experiments on the generic program and generate and interpret the results:
You can follow the tutorial from start to finish, or you can choose the experiment(s) you want to perform.
This section explains how to perform a usertime experiment. The usertime experiment allows you to gather data on the amount of user time spent in each function in your program. For more information on usertime, see the "usertime Experiment" section in Chapter 4, "Experiment Types."
From the command line, enter
ssrun -usertime generic |
This starts the experiment. Output from generic and from ssrun is printed to stdout, as shown in the following example. A data file is also generated. The name consists of the process name (generic), the experiment type, usertime, and the experiment ID. In this example, the filename is generic.usertime.m24479.
0:00:00.000 ======== (24479) Begin script Fri 03 May 96 10:17:13. begin script `ll.u.d.i.f.dso' 0:00:00.032 ======== (24479) start of linklist Sun 03 May 96 10:17:13. linklist completed. 0:00:00.002 ======== (24479) start of usrtime Fri 03 May 96 10:17:13. usertime completed. 0:00:10.824 ======== (24479) start of dirstat Fri 03 May 96 10:17:24. dirstat of /usr/include completed, 242 files. 0:00:10.844 ======== (24479) start of iofile Fri 03 May 96 10:17:24. iofile on /unix completed, 4221656 chars. 0:00:11.036 ======== (24479) start of fpetraps Fri 03 May 96 10:17:24. fpetraps completed. 0:00:11.038 ======== (24479) start of libdso Fri 03 May 96 10:17:24. dlslave_init executed dlslave_routine executed slaveusertime completed, x = 5000000.000000. libdso: dynamic routine returned 13 end of script `u.d.i.f.dso' 0:00:11.930 ======== (24479) End script Fri 03 May 96 10:17:25. |
To generate a report on the data collected, enter at the command line:
prof your_output_file_name > usertime.results |
prof prints results to stdout. Note that the prof output below is a partial listing.
----------------------------------------------------------------------- Profile listing generated Mon Nov 18 11:43:45 1996 with: prof generic.usertime.m24479 ----------------------------------------------------------------------- Total Time (secs) : 43.98 Total Samples : 1466 Stack backtrace failed: 1 Sample interval (ms) : 30 CPU : R4600 FPU : R4600 Clock : 100.0MHz Number of CPUs : 1 ----------------------------------------------------------------------- index %Samples self descendents total name [1] 99.9% 0.00 43.95 1465 __start [2] 99.9% 0.00 43.95 1465 main [3] 99.9% 0.00 43.95 1465 Scriptstring [4] 94.5% 0.00 41.55 1385 usrtime [5] 94.5% 41.52 0.03 1385 anneal [6] 3.0% 0.00 1.32 44 libdso [7] 3.0% 0.00 1.32 44 dlslave_routine [8] 3.0% 1.32 0.00 44 slaveusrtime [9] 2.2% 0.00 0.96 32 iofile [10] 2.2% 0.00 0.96 32 fread [11] 2.1% 0.00 0.93 31 __filbuf [12] 2.1% 0.93 0.00 31 _read [13] 0.2% 0.00 0.09 3 dirstat [14] 0.1% 0.00 0.06 2 _stat [15] 0.1% 0.06 0.00 2 _xstat [16] 0.1% 0.00 0.03 1 linklist [17] 0.1% 0.00 0.03 1 _malloc [18] 0.1% 0.00 0.03 1 _smalloc [19] 0.1% 0.00 0.03 1 __malloc [20] 0.1% 0.00 0.03 1 init2da [21] 0.1% 0.03 0.00 1 __sinf [22] 0.1% 0.00 0.03 1 _readdir [23] 0.1% 0.03 0.00 1 _ngetdents [24] 0.1% 0.03 0.00 1 memcpy |
The report shows information for each function. The remaining columns are described below:
The index column provides an index number for reference.
The %Samples column shows the cumulative percentage of inclusive time spent in each function and its descendents. For example, 99.9% of the time was spent in Scriptstring() and all functions listed below it.
The self column shows how much time, in seconds, was spent in the function itself (exclusive time). For example, less than one hundredth of a second was spent in __start(), but 0.03 of a second was spent in memcpy().
The descendents columns shows how much time, in seconds, was spent in callees of the function. For example, 43.95 seconds were spent in the callees of main().
The total column provides information on the number of samples of the function and all of its descendants.
This section explains how to perform a pcsamp experiment. The pcsamp experiment allows you to gather information on actual CPU time for each source code line, machine instruction, and function in your program. For more information on pcsamp, see the "pcsamp Experiment" section in Chapter 4, "Experiment Types."
From the command line, enter
ssrun –fpcsamp generic |
This starts the experiment. The f option is used with pcsamp for this program because the program runs quickly and does not gather much data using the default pcsamp experiment. Output from generic and from ssrun is printed to stdout as shown in the example below.
A data file is also generated. The name consists of the process name (generic), the experiment type, fpcsamp, and the experiment ID. In this example, the filename is generic.fpcsamp.m14480.
0:00:00.000 ======== (14480) Begin script Sun 19 May 96 17:18:33. begin script `ll.u.d.i.f.dso' 0:00:00.074 ======== (14480) start of linklist Sun 19 May 96 17:18:33. linklist completed. 0:00:00.085 ======== (14480) start of usrtime Sun 19 May 96 17:18:33. usertime completed. 0:00:17.985 ======== (14480) start of dirstat Sun 19 May 96 17:18:51. dirstat of /usr/include completed, 230 files. 0:00:18.008 ======== (14480) start of iofile Sun 19 May 96 17:18:51. iofile on /unix completed, 4221656 chars. 0:00:20.321 ======== (14480) start of fpetraps Sun 19 May 96 17:18:54. fpetraps completed. 0:00:20.323 ======== (14480) start of libdso Sun 19 May 96 17:18:54. dlslave_init executed dlslave_routine executed slaveusertime completed, x = 5000000.000000. libdso: dynamic routine returned 13 end of script `ll.u.d.i.f.dso' 0:00:21.394 ======== (14480) End script Sun 19 May 96 17:18:55. |
To generate a report on the data collected, and to redirect the output to a file, enter the following at the command line:
prof your_output_file_name > pcsamp.results |
Output similar to the following is generated:
------------------------------------------------------------------------------- Profile listing generated Sun May 19 17:21:27 1996 with: prof generic.fpcsamp.m14480 ------------------------------------------------------------------------------- samples time CPU FPU Clock N-cpu S-interval Countsize 19077 19s R4000 R4010 150.0MHz 1 1.0ms 2(bytes) Each sample covers 4 bytes for every 1.0ms ( 0.01% of 19.0770s) ------------------------------------------------------------------------------- -p[rocedures] using pc-sampling. Sorted in descending order by the number of samples in each procedure. Unexecuted procedures are excluded. ------------------------------------------------------------------------------- samples time(%) cum time(%) procedure (dso:file) 17794 18s( 93.3) 18s( 93.3) anneal (/usr/demos/SpeedShop/ generic/generic:/usr/demos/SpeedShop/generic/generic.c) 1046 1s( 5.5) 19s( 98.8) slaveusrtime (/usr/demos/SpeedShop/generic/ dlslave.so:/usr/demos/SpeedShop/generic/dlslave.c) 163 0.16s( 0.9) 19s( 99.6) _read (/usr/lib32/libc.so.1: /work/irix/lib/libc/libc_n32_M3/sys/read.s) 34 0.034s( 0.2) 19s( 99.8) memcpy (/usr/lib32/libc.so.1: /work/irix/lib/libc/libc_n32_M3/strings/bcopy.s) 20 0.02s( 0.1) 19s( 99.9) _xstat (/usr/lib32/libc.so.1: /work/irix/lib/libc/libc_n32_M3/sys/xstat.s) 8 0.008s( 0.0) 19s( 99.9) fread (/usr/lib32/libc.so.1: /work/irix/lib/libc/libc_n32_M3/stdio/fread.c) 3 0.003s( 0.0) 19s(100.0) iofile (/usr/demos/SpeedShop/generic/ generic:/usr/demos/SpeedShop/generic/generic.c) 3 0.003s( 0.0) 19s(100.0) _doprnt (/usr/lib32/libc.so.1: /work/irix/lib/libc/libc_n32_M3/print/doprnt.c) 1 0.001s( 0.0) 19s(100.0) __sinf (/usr/lib32/libm.so: /work/cmplrs/libm/fsin.c) 1 0.001s( 0.0) 19s(100.0) init2da (/usr/demos/SpeedShop/generic/ generic:/usr/demos/SpeedShop/generic/generic.c) 1 0.001s( 0.0) 19s(100.0) _write (/usr/lib32/libc.so.1: /work/irix/lib/libc/libc_n32_M3/sys/write.s) 1 0.001s( 0.0) 19s(100.0) _drand48 (/usr/lib32/libc.so.1: /work/irix/lib/libc/libc_n32_M3/math/drand48.c) 1 0.001s( 0.0) 19s(100.0) _morecore (/usr/lib32/libc.so.1: /work/irix/lib/libc/libc_n32_M3/gen/malloc.c) 1 0.001s( 0.0) 19s(100.0) fwrite (/usr/lib32/libc.so.1: /work/irix/lib/libc/libc_n32_M3/stdio/fwrite.c) 19077 19s(100.0) 19s(100.0) TOTAL |
The report has the following columns:
The samples columns shows how many samples were taken when the process was executing in the function.
The time(%) column shows the amount of time, and the percentage of that time over the total time that was spent in the function.
The cum time(%) column shows how much time has been spent in the function up to and including the procedure in the list.
The procedure (dso:file) column lists the procedure, its DSO name and file name. For example, the first line reports statistics for the procedure anneal() in the file generic.c in the generic executable.
Note: This experiment can be performed only on systems that have built-in hardware counters (the R10000 class of machines). |
This section takes you through the steps to perform a hardware counter experiment. There are a number of hardware counter experiments, but this tutorial describes the steps involved in performing the dsc_hwc experiment. This experiment allows you to capture information about secondary data cache misses. For more information on hardware counter experiments, see "ideal Experiment" in Chapter 4, "Experiment Types."
From the command line, enter
ssrun -dsc_hwc generic |
This starts the experiment. Output from generic and from ssrun is printed to stdout.
A data file is also generated. The name consists of the process name (generic), the experiment type, dsc_hwc, and the experiment ID. In this example, the filename is generic.dsc_hwc.m5999.
To generate a report on the data collected and redirect the output to a file, enter the following at the command line:
prof your_output_file_name > dsc_hwc.results |
The report should look similar to the following partial listing:
------------------------------------------------------------------------------- Profile listing generated Thu Jun 5 13:23:14 1997 with: prof generic.dsc_hwc.m5999 ------------------------------------------------------------------------------- Counter : Sec cache D misses Counter overflow value: 131 Total number of ovfls : 10 CPU : R10000 FPU : R10010 Clock : 196.0MHz Number of CPUs : 1 ------------------------------------------------------------------------------- -p[rocedures] using counter overflow. Sorted in descending order by the number of overflows in each procedure. Unexecuted or inlined procedures are excluded. ------------------------------------------------------------------------------- overflows(%) cum overflows(%) procedure (dso:file) 4( 40.0) 4( 40.0) memcpy (/usr/lib32/libc.so.1:bcopy.s) 2( 20.0) 6( 60.0) anneal (generic:generic.c) 2( 20.0) 8( 80.0) init2da (generic:generic.c) 1( 10.0) 9( 90.0) _smalloc (/usr/lib32/libc.so.1:malloc.c) 1( 10.0) 10(100.0) _doprnt (/usr/lib32/libc.so.1:doprnt.c) 10(100.0) TOTAL |
The columns in the report provide the following information:
The overflows(%) column shows the number of overflows caused by the function, and the percentage of that number over the total number of overflows in the program.
The cum overflows(%) column shows a cumulative number and percentage of overflows. For example, the anneal() function shows two overflows, but the cumulative number of overflows is six. Two overflows come from anneal() and four come from memcpy().
The procedure (dso:file) column shows the procedure name and the DSO and filename that contain the procedure.
This section takes you through the steps to perform an ideal experiment. For more information on ideal, see the "ideal Experiment" section in Chapter 4, "Experiment Types."
From the command line, enter
ssrun -ideal generic |
This starts the experiment. First the executable and libraries are instrumented using pixie. This entails making copies of the libraries and executables, giving the copies an extension that depends on the ABI, and inserting information into the copies. The extension is .pixie for the executable, .pix32 for all 32 libraries, .pixn32 for all n32 libraries, and .pix64 for all 64 libraries.
Output from generic and from ssrun is printed to stdout. A data file is also generated. The name consists of the process name (generic), the experiment type, ideal, and the experiment ID. In this example, the filename is generic.ideal.m14517, and the output to stdout looks like the following:
Beginning libraries /usr/lib32/libssrt.so /usr/lib32/libss.so /usr/lib32/libm.so /usr/lib32/libc.so.1 Ending libraries, beginning "generic" ... Beginning libraries Ending libraries, beginning "dlslave.so" ... |
The output section that starts with Beginning libraries and ends with Ending libraries, beginning `generic' tells you that ssrun is instrumenting first the libraries listed in the executable, then the generic executable itself. The section that starts Beginning libraries and ends with Ending libraries, beginning `dlslave.so' is added when the DSO dlslave.so is dlopen'd.
To generate a report on the data collected, enter the following at the command line:
prof your_output_file_name > ideal.results |
This command redirects output to a file called ideal.results. The file contains results that look similar to the following partial listing:
Prof run at: Sun May 19 17:49:10 1996 Command line: prof generic.ideal.m14517 2662778531: Total number of cycles 17.75186s: Total execution time 1875323907: Total number of instructions executed 1.420: Ratio of cycles / instruction 150: Clock rate in MHz R4000: Target processor modelled --------------------------------------------------------- Procedures sorted in descending order of cycles executed. Unexecuted procedures are not listed. Procedures beginning with *DF* are dummy functions and represent init, fini and stub sections. --------------------------------------------------------- cycles(%) cum % secs instrns calls procedure(dso:file) 2524610038(94.81) 94.81 16.83 1797940023 1 anneal(generic.pixie:/usr /demos/SpeedShop/generic/generic.c) 135001332( 5.07) 99.88 0.90 75000822 1 slaveusrtime (./dlslave.so.pixn32:/usr/demos/SpeedShop/generic/dlslave.c) 1593422( 0.06) 99.94 0.01 1378737 4382 memcpy(./libc.so.1.pixn32: /work/irix/lib/libc/libc_n32_M3/strings/bcopy.s) 735797( 0.03) 99.97 0.00 506627 4123 fread(./libc.so.1.pixn32: /work/irix/lib/libc/libc_n32_M3/stdio/fread.c) 187200( 0.01) 99.98 0.00 124800 1600 next(./libc.so.1.pixn32: /work/irix/lib/libc/libc_n32_M3/math/drand48.c) 136116( 0.01) 99.98 0.00 82498 1 iofile(generic.pixie: /usr/demos/SpeedShop/generic/generic.c) 91200( 0.00) 99.98 0.00 62400 1600 _drand48(./libc.so.1.pixn32: /work/irix/lib/libc/libc_n32_M3/math/drand48.c) 85497( 0.00) 99.99 0.00 56518 1 init2da(generic.pixie: /usr/demos/SpeedShop/generic/generic.c) 74095( 0.00) 99.99 0.00 28063 628 __sinf(./libm.so.pixn32: /work/cmplrs/libm/fsin.c) 56192( 0.00) 99.99 0.00 9360 16 offtime(./libc.so.1.pixn32: /work/irix/lib/libc/libc_n32_M3/gen/time_comm.c) 51431( 0.00) 99.99 0.00 36405 35 _doprnt(./libc.so.1.pixn32: /work/irix/lib/libc/libc_n32_M3/print/doprnt.c) 27951( 0.00) 100.00 0.00 19670 259 __filbuf(./libc.so.1.pixn32: /work/irix/lib/libc/libc_n32_M3/stdio/_filbuf.c) 21392( 0.00) 100.00 0.00 10136 58 fwrite(./libc.so.1.pixn32: /work/irix/lib/libc/libc_n32_M3/stdio/fwrite.c) 12744( 0.00) 100.00 0.00 9497 231 _readdir(./libc.so.1.pixn32: /work/irix/lib/libc/libc_n32_M3/gen/readdir.c) 9960( 0.00) 100.00 0.00 7536 96 _xflsbuf(./libc.so.1.pixn32: /work/irix/lib/libc/libc_n32_M3/stdio/flush.c) 7211( 0.00) 100.00 0.00 3959 1 dirstat(generic.pixie: /usr/demos/SpeedShop/generic/generic.c) |
The columns in the report provide the following information
The cycles(%) column reports the number and percentage of machine cycles used for the procedure. For example, 2524610038 cycles, or 94.81% of cycles were spent in the anneal() procedure.
The cum% column shows the cumulative percentage of cycles. For example, 99.88% of all cycles were spent between the top two functions in the listing: anneal() and slaveusrtime().
The secs column shows the number of seconds spent in the procedure. For example, 16.83 seconds were spent in the anneal() procedure. The time represents an idealized computation based on modelling the machine. It ignores potential floating point interlocks and memory latency time (cache misses and memory bus contention).
The instrns column shows the number of instructions executed for a procedure. For example, there were 1797940023 instructions devoted to the anneal() procedure.
The calls column reports the number of calls to the procedure. For example, there was just one call to the anneal() procedure.
The procedure (dso:file) column lists the procedure, its DSO name and filename. For example, the first line reports statistics for the procedure anneal() in the file generic.c in the generic executable.
This section takes you through the steps to perform an fpe trace. For more information on the fpe trace, see the "fpe Trace" section in Chapter 4, "Experiment Types."
From the command line, enter
ssrun -fpe generic |
This starts the experiment. Output from generic and from ssrun is printed to stdout. A data file with a name generated by concatenating the process name, generic, the experiment type, fpe, and the experiment ID is also generated. In this example, the filename is generic.fpe.m18823.
To generate a report on the data collected, enter the following at the command line:
prof your_output_file_name > fpe.results |
The report should look similar to the following partial listing:
----------------------------------------------------------------------- Profile listing generated Mon Nov 18 11:46:33 1996 with: prof generic.fpe.m18823 ----------------------------------------------------------------------- Total FPEs : 4 Stack backtrace failed: 0 CPU : R4600 FPU : R4600 Clock : 100.0MHz Number of CPUs : 1 ----------------------------------------------------------------------- index %FPEs self descendents total name [1] 100.0% 0 4 4 __start [2] 100.0% 0 4 4 main [3] 100.0% 0 4 4 Scriptstring [4] 100.0% 4 0 4 fpetraps |
The report shows information for each function. The function names are shown in the right-hand column of the report. The remaining columns are described below:
The index column provides an index number for reference.
The %FPEs column shows the percentage of the total number of floating point exceptions that were found in the function.
The self column shows how many floating point exceptions were found in the function. For example, four floating point exceptions were found in fpetraps.
The descendents columns shows how many floating point exceptions were found in the descendents of the function. For example, four floating point exceptions were found in the descendents of main().
The total column provides information on the number of floating point exceptions out of the total that were found.
This concludes the tutorial.