Chapter 2. Tutorial for C Users

Chapter 2. Tutorial for C Users
Prev		Next

This chapter presents a tutorial for using the SpeedShop tools to gather and analyze performance data in a C program, and covers these topics:

"Tutorial Overview" introduces the sample program and explains the different scenarios in which it will be used.
"Tutorial Setup" steps you through the necessary setup for running the experiment.
"Analyzing Performance Data" steps you through five different experiments, discussing first how to do the experiments, then how to interpret the results.

Note: Because of inherent differences between systems and because of concurrent processes that may be running on your system, your experiment will produce different results from the one in this tutorial. However, the basic form of the results should be the same.

Tutorial Overview

This tutorial uses a sample program called generic. There are three versions of the program:

`generic` directory	Contains files for the n32-bit ABI
`generic64` directory	Contains files for the 64-bit ABI
`generico32` directory	Contains files for the (old) 32-bit ABI

When you work with the tutorial, choose the version of generic most appropriate for your system. A good guideline is to choose the version that corresponds to the way you expect to develop your programs.

This tutorial was written and tested using the version of generic in the generic directory.

Contents of the generic Program

generic was designed as a test and demonstration application. It contains code to run scenarios that each test a different area of SpeedShop. The version of generic used in this tutorial performs scenarios that:

build a linked list of structures
use a lot of usertime
scan a directory and run the stat command on each file
perform file I/O
generate a number of floating point exceptions
link and call a routine in a DSO

Output from the generic Program

Output from the program looks like the following:

0:00:00.000 ======== (24479) Begin script Fri 03 May 96  10:17:13.
begin script `ll.u.d.i.f.dso'
0:00:00.032 ======== (24479) start of linklist Sun 03 May 96  10:17:13.
linklist completed.
0:00:00.002 ======== (24479) start of usrtime Fri 03 May 96  10:17:13.
usertime completed.
0:00:10.824 ======== (24479) start of dirstat Fri 03 May 96  10:17:24.
dirstat of /usr/include completed, 242 files.
0:00:10.844 ======== (24479) start of iofile Fri 03 May 96  10:17:24.
iofile on /unix completed, 4221656 chars.
0:00:11.036 ======== (24479) start of fpetraps Fri 03 May 96  10:17:24.
fpetraps completed.
0:00:11.038 ======== (24479) start of libdso Fri  03 May 96  10:17:24.
dlslave_init executed
dlslave_routine executed
        slaveusertime completed, x = 5000000.000000.
        libdso: dynamic routine returned 13
        end of script `u.d.i.f.dso'
0:00:11.930 ======== (24479) End script Fri  03 May 96  10:17:25.

Use of the generic Program

The tutorial shows you how to perform the following experiments using ssrun, and how to interpret experiment-specific reports generated by prof:

usertime
pcsamp
dsc_hwc
ideal
fpe

Tutorial Setup

Copy the program to a directory where you have write permission and compile it so that you can use it in the tutorial.

Change to the /usr/demos/SpeedShop directory.
Copy the appropriate generic directory and its contents to a directory where you have write permission:
cp -r generic_version your_dir
Change to the directory you just created:
cd your_dir/generic_version
Compile the program by entering
make all
This provides an executable for the experiment.
Set the library path so that the program can find shared libraries in the generic directory:
setenv LD_LIBRARY_PATH your_dir/generic_version

Analyzing Performance Data

This section explains how to run the following experiments on the generic program and generate and interpret the results:

You can follow the tutorial from start to finish, or you can choose the experiment(s) you want to perform.

usertime Experiment

This section explains how to perform a usertime experiment. The usertime experiment allows you to gather data on the amount of user time spent in each function in your program. For more information on usertime, see the "usertime Experiment" section in Chapter 4, "Experiment Types."

Performing a usertime Experiment

From the command line, enter

ssrun -usertime generic

This starts the experiment. Output from generic and from ssrun is printed to stdout, as shown in the following example. A data file is also generated. The name consists of the process name (generic), the experiment type, usertime, and the experiment ID. In this example, the filename is generic.usertime.m24479.

0:00:00.000 ======== (24479) Begin script Fri 03 May 96  10:17:13.
begin script `ll.u.d.i.f.dso'
0:00:00.032 ======== (24479) start of linklist Sun 03 May 96  10:17:13.
linklist completed.
0:00:00.002 ======== (24479) start of usrtime Fri 03 May 96  10:17:13.
usertime completed.
0:00:10.824 ======== (24479) start of dirstat Fri 03 May 96  10:17:24.
dirstat of /usr/include completed, 242 files.
0:00:10.844 ======== (24479) start of iofile Fri 03 May 96  10:17:24.
iofile on /unix completed, 4221656 chars.
0:00:11.036 ======== (24479) start of fpetraps Fri 03 May 96  10:17:24.
fpetraps completed.
0:00:11.038 ======== (24479) start of libdso Fri  03 May 96  10:17:24.
dlslave_init executed
dlslave_routine executed
        slaveusertime completed, x = 5000000.000000.
        libdso: dynamic routine returned 13
        end of script `u.d.i.f.dso'
0:00:11.930 ======== (24479) End script Fri  03 May 96  10:17:25.

Generating a Report

To generate a report on the data collected, enter at the command line:

prof your_output_file_name > usertime.results

prof prints results to stdout. Note that the prof output below is a partial listing.

-----------------------------------------------------------------------
Profile listing generated Mon Nov 18 11:43:45 1996
    with: prof generic.usertime.m24479 
-----------------------------------------------------------------------
Total Time (secs)     : 43.98
Total Samples         : 1466
Stack backtrace failed: 1
Sample interval (ms)  : 30
CPU                   : R4600
FPU                   : R4600
Clock                 : 100.0MHz
Number of CPUs        : 1
-----------------------------------------------------------------------
index  %Samples  self descendents  total          name
[1]     99.9%    0.00       43.95    1465         __start
[2]     99.9%    0.00       43.95    1465         main
[3]     99.9%    0.00       43.95    1465         Scriptstring
[4]     94.5%    0.00       41.55    1385          usrtime
[5]     94.5%   41.52        0.03    1385          anneal
[6]      3.0%    0.00        1.32      44          libdso
[7]      3.0%    0.00        1.32      44          dlslave_routine
[8]      3.0%    1.32        0.00      44          slaveusrtime
[9]      2.2%    0.00        0.96      32          iofile
[10]     2.2%    0.00        0.96      32          fread
[11]     2.1%    0.00        0.93      31          __filbuf
[12]     2.1%    0.93        0.00      31          _read
[13]     0.2%    0.00        0.09       3          dirstat
[14]     0.1%    0.00        0.06       2          _stat
[15]     0.1%    0.06        0.00       2          _xstat
[16]     0.1%    0.00        0.03       1          linklist
[17]     0.1%    0.00        0.03       1          _malloc
[18]     0.1%    0.00        0.03       1          _smalloc
[19]     0.1%    0.00        0.03       1          __malloc
[20]     0.1%    0.00        0.03       1          init2da
[21]     0.1%    0.03        0.00       1          __sinf
[22]     0.1%    0.00        0.03       1          _readdir
[23]     0.1%    0.03        0.00       1          _ngetdents
[24]     0.1%    0.03        0.00       1          memcpy

Analyzing the Report

The report shows information for each function. The remaining columns are described below:

The index column provides an index number for reference.
The %Samples column shows the cumulative percentage of inclusive time spent in each function and its descendents. For example, 99.9% of the time was spent in Scriptstring() and all functions listed below it.
The self column shows how much time, in seconds, was spent in the function itself (exclusive time). For example, less than one hundredth of a second was spent in __start(), but 0.03 of a second was spent in memcpy().
The descendents columns shows how much time, in seconds, was spent in callees of the function. For example, 43.95 seconds were spent in the callees of main().
The total column provides information on the number of samples of the function and all of its descendants.

pcsamp Experiment

This section explains how to perform a pcsamp experiment. The pcsamp experiment allows you to gather information on actual CPU time for each source code line, machine instruction, and function in your program. For more information on pcsamp, see the "pcsamp Experiment" section in Chapter 4, "Experiment Types."

From the command line, enter

ssrun –fpcsamp generic

This starts the experiment. The f option is used with pcsamp for this program because the program runs quickly and does not gather much data using the default pcsamp experiment. Output from generic and from ssrun is printed to stdout as shown in the example below.

A data file is also generated. The name consists of the process name (generic), the experiment type, fpcsamp, and the experiment ID. In this example, the filename is generic.fpcsamp.m14480.

0:00:00.000 ======== (14480)             Begin script Sun  19 May 96  17:18:33.
        begin script `ll.u.d.i.f.dso'
0:00:00.074 ======== (14480)        start of linklist Sun  19 May 96  17:18:33.
        linklist completed.
0:00:00.085 ======== (14480)         start of usrtime Sun  19 May 96  17:18:33.
        usertime completed.
0:00:17.985 ======== (14480)         start of dirstat Sun  19 May 96  17:18:51.
        dirstat of /usr/include completed, 230 files.
0:00:18.008 ======== (14480)          start of iofile Sun  19 May 96  17:18:51.
        iofile on /unix completed, 4221656 chars.
0:00:20.321 ======== (14480)        start of fpetraps Sun  19 May 96  17:18:54.
        fpetraps completed.
0:00:20.323 ======== (14480)          start of libdso Sun  19 May 96  17:18:54.
dlslave_init executed
dlslave_routine executed
        slaveusertime completed, x = 5000000.000000.
        libdso: dynamic routine returned 13
        end of script `ll.u.d.i.f.dso'
0:00:21.394 ======== (14480)               End script Sun  19 May 96  17:18:55.

Generating a Report

To generate a report on the data collected, and to redirect the output to a file, enter the following at the command line:

prof your_output_file_name > pcsamp.results

Output similar to the following is generated:

-------------------------------------------------------------------------------
Profile listing generated Sun May 19 17:21:27 1996
    with:       prof generic.fpcsamp.m14480 
-------------------------------------------------------------------------------
samples   time    CPU    FPU   Clock   N-cpu  S-interval Countsize
  19077    19s  R4000  R4010 150.0MHz   1      1.0ms     2(bytes)
 
Each sample covers 4 bytes for every  1.0ms ( 0.01% of 19.0770s)
-------------------------------------------------------------------------------
  -p[rocedures] using pc-sampling.
  Sorted in descending order by the number of samples in each procedure.
  Unexecuted procedures are excluded.
-------------------------------------------------------------------------------
samples   time(%)      cum time(%)      procedure (dso:file)
 
  17794    18s( 93.3)   18s( 93.3)         anneal (/usr/demos/SpeedShop/
          generic/generic:/usr/demos/SpeedShop/generic/generic.c)
   1046     1s(  5.5)   19s( 98.8)   slaveusrtime (/usr/demos/SpeedShop/generic/
          dlslave.so:/usr/demos/SpeedShop/generic/dlslave.c)
    163  0.16s(  0.9)   19s( 99.6)          _read (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/sys/read.s)
     34 0.034s(  0.2)   19s( 99.8)         memcpy (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/strings/bcopy.s)
     20  0.02s(  0.1)   19s( 99.9)         _xstat (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/sys/xstat.s)
      8 0.008s(  0.0)   19s( 99.9)          fread (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/stdio/fread.c)
      3 0.003s(  0.0)   19s(100.0)         iofile (/usr/demos/SpeedShop/generic/
          generic:/usr/demos/SpeedShop/generic/generic.c)
      3 0.003s(  0.0)   19s(100.0)        _doprnt (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/print/doprnt.c)
      1 0.001s(  0.0)   19s(100.0)         __sinf (/usr/lib32/libm.so:
          /work/cmplrs/libm/fsin.c)
      1 0.001s(  0.0)   19s(100.0)        init2da (/usr/demos/SpeedShop/generic/
          generic:/usr/demos/SpeedShop/generic/generic.c)
      1 0.001s(  0.0)   19s(100.0)         _write (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/sys/write.s)
      1 0.001s(  0.0)   19s(100.0)       _drand48 (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/math/drand48.c)
      1 0.001s(  0.0)   19s(100.0)      _morecore (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/gen/malloc.c)
      1 0.001s(  0.0)   19s(100.0)         fwrite (/usr/lib32/libc.so.1:
          /work/irix/lib/libc/libc_n32_M3/stdio/fwrite.c)
 
  19077    19s(100.0)   19s(100.0)          TOTAL

Analyzing the Report

The report has the following columns:

The samples columns shows how many samples were taken when the process was executing in the function.
The time(%) column shows the amount of time, and the percentage of that time over the total time that was spent in the function.
The cum time(%) column shows how much time has been spent in the function up to and including the procedure in the list.
The procedure (dso:file) column lists the procedure, its DSO name and file name. For example, the first line reports statistics for the procedure anneal() in the file generic.c in the generic executable.

Hardware Counter Experiment

Note: This experiment can be performed only on systems that have built-in hardware counters (the R10000 class of machines).

This section takes you through the steps to perform a hardware counter experiment. There are a number of hardware counter experiments, but this tutorial describes the steps involved in performing the dsc_hwc experiment. This experiment allows you to capture information about secondary data cache misses. For more information on hardware counter experiments, see "ideal Experiment" in Chapter 4, "Experiment Types."

Performing a Hardware Counter Experiment

From the command line, enter

ssrun -dsc_hwc generic

This starts the experiment. Output from generic and from ssrun is printed to stdout.
A data file is also generated. The name consists of the process name (generic), the experiment type, dsc_hwc, and the experiment ID. In this example, the filename is generic.dsc_hwc.m5999.

Generating a Report

To generate a report on the data collected and redirect the output to a file, enter the following at the command line:

prof your_output_file_name > dsc_hwc.results

The report should look similar to the following partial listing:

-------------------------------------------------------------------------------
Profile listing generated Thu Jun  5 13:23:14 1997
    with:     prof generic.dsc_hwc.m5999 
-------------------------------------------------------------------------------
  
     Counter               : Sec cache D misses
     Counter overflow value: 131
     Total number of ovfls : 10
     CPU                   : R10000
     FPU                   : R10010
     Clock                 : 196.0MHz
     Number of CPUs        : 1
    
-------------------------------------------------------------------------------
  -p[rocedures] using counter overflow.
  Sorted in descending order by the number of overflows in each procedure.
  Unexecuted or inlined procedures are excluded.
-------------------------------------------------------------------------------
  
  overflows(%)  cum overflows(%)      procedure (dso:file)
  
      4( 40.0)          4( 40.0)         memcpy (/usr/lib32/libc.so.1:bcopy.s)
      2( 20.0)          6( 60.0)         anneal (generic:generic.c)
      2( 20.0)          8( 80.0)        init2da (generic:generic.c)
1( 10.0)          9( 90.0)       _smalloc (/usr/lib32/libc.so.1:malloc.c)
      1( 10.0)         10(100.0)        _doprnt (/usr/lib32/libc.so.1:doprnt.c)
  
     10(100.0)                      TOTAL

Analyzing the Report

The columns in the report provide the following information:

The overflows(%) column shows the number of overflows caused by the function, and the percentage of that number over the total number of overflows in the program.
The cum overflows(%) column shows a cumulative number and percentage of overflows. For example, the anneal() function shows two overflows, but the cumulative number of overflows is six. Two overflows come from anneal() and four come from memcpy().
The procedure (dso:file) column shows the procedure name and the DSO and filename that contain the procedure.

ideal Experiment

This section takes you through the steps to perform an ideal experiment. For more information on ideal, see the "ideal Experiment" section in Chapter 4, "Experiment Types."

Performing an ideal Experiment

From the command line, enter

ssrun -ideal generic

This starts the experiment. First the executable and libraries are instrumented using pixie. This entails making copies of the libraries and executables, giving the copies an extension that depends on the ABI, and inserting information into the copies. The extension is .pixie for the executable, .pix32 for all 32 libraries, .pixn32 for all n32 libraries, and .pix64 for all 64 libraries.

Output from generic and from ssrun is printed to stdout. A data file is also generated. The name consists of the process name (generic), the experiment type, ideal, and the experiment ID. In this example, the filename is generic.ideal.m14517, and the output to stdout looks like the following:

Beginning libraries
        /usr/lib32/libssrt.so
        /usr/lib32/libss.so
        /usr/lib32/libm.so
        /usr/lib32/libc.so.1
Ending libraries, beginning "generic"
...
Beginning libraries
Ending libraries, beginning "dlslave.so"
...

The output section that starts with Beginning libraries and ends with Ending libraries, beginning `generic' tells you that ssrun is instrumenting first the libraries listed in the executable, then the generic executable itself. The section that starts Beginning libraries and ends with Ending libraries, beginning `dlslave.so' is added when the DSO dlslave.so is dlopen'd.

Generating a Report

To generate a report on the data collected, enter the following at the command line:

prof your_output_file_name > ideal.results

This command redirects output to a file called ideal.results. The file contains results that look similar to the following partial listing:

Prof run at: Sun May 19 17:49:10 1996
 Command line: prof generic.ideal.m14517 
 
         2662778531: Total number of cycles
          17.75186s: Total execution time
         1875323907: Total number of instructions executed
              1.420: Ratio of cycles / instruction
                150: Clock rate in MHz
              R4000: Target processor modelled
---------------------------------------------------------
Procedures sorted in descending order of cycles executed.
Unexecuted procedures are not listed. Procedures
beginning with *DF* are dummy functions and represent
init, fini and stub sections.
---------------------------------------------------------
        cycles(%)  cum %     secs    instrns    calls procedure(dso:file)
 
2524610038(94.81)  94.81    16.83 1797940023     1 anneal(generic.pixie:/usr
          /demos/SpeedShop/generic/generic.c)
 135001332( 5.07)  99.88     0.90   75000822     1 slaveusrtime
          (./dlslave.so.pixn32:/usr/demos/SpeedShop/generic/dlslave.c)
   1593422( 0.06)  99.94     0.01    1378737  4382 memcpy(./libc.so.1.pixn32:
          /work/irix/lib/libc/libc_n32_M3/strings/bcopy.s)
    735797( 0.03)  99.97     0.00     506627  4123 fread(./libc.so.1.pixn32:
          /work/irix/lib/libc/libc_n32_M3/stdio/fread.c)
    187200( 0.01)  99.98     0.00     124800  1600 next(./libc.so.1.pixn32:
          /work/irix/lib/libc/libc_n32_M3/math/drand48.c)
    136116( 0.01)  99.98     0.00      82498     1 iofile(generic.pixie:
          /usr/demos/SpeedShop/generic/generic.c)
     91200( 0.00)  99.98     0.00      62400  1600 _drand48(./libc.so.1.pixn32:
          /work/irix/lib/libc/libc_n32_M3/math/drand48.c)
     85497( 0.00)  99.99     0.00      56518     1 init2da(generic.pixie:
          /usr/demos/SpeedShop/generic/generic.c)
     74095( 0.00)  99.99     0.00      28063   628 __sinf(./libm.so.pixn32:
          /work/cmplrs/libm/fsin.c)
     56192( 0.00)  99.99     0.00       9360    16 offtime(./libc.so.1.pixn32:
          /work/irix/lib/libc/libc_n32_M3/gen/time_comm.c)
     51431( 0.00)  99.99     0.00      36405    35 _doprnt(./libc.so.1.pixn32:
          /work/irix/lib/libc/libc_n32_M3/print/doprnt.c)
     27951( 0.00) 100.00     0.00      19670   259 __filbuf(./libc.so.1.pixn32:
          /work/irix/lib/libc/libc_n32_M3/stdio/_filbuf.c)
     21392( 0.00) 100.00     0.00      10136    58 fwrite(./libc.so.1.pixn32:
          /work/irix/lib/libc/libc_n32_M3/stdio/fwrite.c)
     12744( 0.00) 100.00     0.00       9497   231 _readdir(./libc.so.1.pixn32:
          /work/irix/lib/libc/libc_n32_M3/gen/readdir.c)
      9960( 0.00) 100.00     0.00       7536    96 _xflsbuf(./libc.so.1.pixn32:
          /work/irix/lib/libc/libc_n32_M3/stdio/flush.c)
      7211( 0.00) 100.00     0.00       3959     1 dirstat(generic.pixie:
          /usr/demos/SpeedShop/generic/generic.c)

Analyzing the Report

The columns in the report provide the following information

The cycles(%) column reports the number and percentage of machine cycles used for the procedure. For example, 2524610038 cycles, or 94.81% of cycles were spent in the anneal() procedure.
The cum% column shows the cumulative percentage of cycles. For example, 99.88% of all cycles were spent between the top two functions in the listing: anneal() and slaveusrtime().
The secs column shows the number of seconds spent in the procedure. For example, 16.83 seconds were spent in the anneal() procedure. The time represents an idealized computation based on modelling the machine. It ignores potential floating point interlocks and memory latency time (cache misses and memory bus contention).
The instrns column shows the number of instructions executed for a procedure. For example, there were 1797940023 instructions devoted to the anneal() procedure.
The calls column reports the number of calls to the procedure. For example, there was just one call to the anneal() procedure.
The procedure (dso:file) column lists the procedure, its DSO name and filename. For example, the first line reports statistics for the procedure anneal() in the file generic.c in the generic executable.

fpe Trace

This section takes you through the steps to perform an fpe trace. For more information on the fpe trace, see the "fpe Trace" section in Chapter 4, "Experiment Types."

Performing an fpe Trace

From the command line, enter

ssrun -fpe generic

This starts the experiment. Output from generic and from ssrun is printed to stdout. A data file with a name generated by concatenating the process name, generic, the experiment type, fpe, and the experiment ID is also generated. In this example, the filename is generic.fpe.m18823.

Generating a Report

To generate a report on the data collected, enter the following at the command line:

prof your_output_file_name > fpe.results

The report should look similar to the following partial listing:

-----------------------------------------------------------------------
Profile listing generated Mon Nov 18 11:46:33 1996
    with:     prof generic.fpe.m18823 
-----------------------------------------------------------------------
Total FPEs            : 4
Stack backtrace failed: 0
CPU                   : R4600
FPU                   : R4600
Clock                 : 100.0MHz
Number of CPUs        : 1
-----------------------------------------------------------------------
index  %FPEs     self descendents  total    name
[1]    100.0%       0           4       4   __start
[2]    100.0%       0           4       4   main
[3]    100.0%       0           4       4   Scriptstring
[4]    100.0%       4           0       4   fpetraps

Analyzing the Report

The report shows information for each function. The function names are shown in the right-hand column of the report. The remaining columns are described below:

The index column provides an index number for reference.
The %FPEs column shows the percentage of the total number of floating point exceptions that were found in the function.
The self column shows how many floating point exceptions were found in the function. For example, four floating point exceptions were found in fpetraps.
The descendents columns shows how many floating point exceptions were found in the descendents of the function. For example, four floating point exceptions were found in the descendents of main().
The total column provides information on the number of floating point exceptions out of the total that were found.

This concludes the tutorial.

Prev	Table of Contents	Next
Chapter 1. Introduction to Performance Analysis		Chapter 3. Tutorial for Fortran Users