Performance

 3-10  PERFORMANCE MEASUREMENT
 ****************************

 (The CPU time measurement techniques on VMS are due to Arne Vajhoej)

 Several test suits are used to compare the performance of computers 
 by measuring the CPU time used in running a standard set of subprograms, 
 and performing some kind of averaging.

 Remember that such tests really measure the combined performance of 
 the hardware (CPU, memory, I/O devices) and the software (Operating 
 System, compiler used, compiler switches used), so any change in the 
 configuration will change the results.

 Hardware and software changes all the time, so results should have 
 attached a description of the hardware and software used in the test. 
 Even a different version of the compiler may optimize differently and 
 change the results.

 The most popular test suit is probably Specfp92 (or Specfp95).

 We will discuss only a special case, measuring the performance of your 
 own program.


 Wall-clock time vs. CPU time
 ----------------------------
 Wall-clock time is the time elapsed between the invocation and 
 termination of your program, you can measure it with an ordinary
 clock.

 CPU time is the combined time you had access to the CPU during the
 run, if there is more than one CPU, the CPU time for each is added.
 Operating systems usually record CPU usage information and like to
 do bookkeeping, so you can ask your OS and get the CPU time. 

 Of course CPU time on a single CPU machine will be less than wall-clock
 time, the CPU is shared with various system processes and possibly
 other users.


 Measuring performance
 ---------------------
 The CPU time needed to run a program (concurrently or not) varies 
 'randomly' when other programs run, because the 'workload' they cause 
 is usually inconstant. Possible mechanisms responsible for that 
 phenomenon are:

    1) Cache contamination by the other processes running between
       the CPU time-slices your program get, forces your program
       to use the much slower main memory.

    2) When there are more processes sharing 'physical' memory,  
       more PAGING is needed, paging takes CPU time that may be 
       charged to your program (VMS).

 Because of the uncontrollable variations in the workload, we must
 adopt a statistical approach, and compute the average of many 
 time measurements. 


 Recommended tests
 -----------------
 1) Measure the relative performance of REAL*4 and REAL*8 on your
    machine, you will probably get similar results, try also REAL*16
    if you have support for it.

 2) Measure the performance degradation when forcing array bounds
    checking with a suitable compiler switch (see chapter on 
    compiler switches).

 3) Measure how performance change when you vary the compiler
    'optimization level'.


 Measuring CPU time on VMS
 =========================
 Use one of the "clock" functions from inside your program, that
 way you will avoid the overhead associated with image startup and
 image rundown. 

 These functions returns the elapsed CPU time used by the process
 calling them and its subprocesses, or CPU time elapsed since a
 "timer" was started.


  "Clock functions" on VMS
  ------------------------
 There are many "clock functions" on VMS:

   1) The triad  LIB$INIT_TIMER/LIB$STAT_TIMER/LIB$FREE_TIMER is
      supposed to be the "official clock function". 

      The triad should be called in the sequence specified above. 
      LIB$INIT_TIMER allocates some memory you should free with 
      LIB$FREE_TIMER. 

      A good program should establish a control/y handler to 
      deallocate that memory on user interrupt.

   2) LIB$GETJPI, the simplified interface to SYS$GETJPI, is the 
      recommended "clock function" (see example below). The time
      unit it uses is 10 milliseconds.

   2) SYS$GETJPI, the system service all "clock functions" use.


 The following are unsupported and undocumented:

   1) PAS$CLOCK2, a Pascal "clock function" found on every VMS.
   2) PAS$CLOCK, another Pascal "clock function".
   3) DECC$CLOCK, DECC "clock function".
   4) CLOCK, VAXC "clock function", link against VAXCRTL. 


 An example program:


      PROGRAM CPUCLK
C     ------------------------------------------------------------------
      INCLUDE '($JPIDEF)'
C     ------------------------------------------------------------------
      INTEGER       
     *              TIME1, TIME2,
     *              I
C     ------------------------------------------------------------------
      REAL*4
     *              TMP4
C     ------------------------------------------------------------------
      REAL*8
     *              TMP8
C     ------------------------------------------------------------------
      REAL*16
     *              TMP16
C     ------------------------------------------------------------------
      CALL LIB$GETJPI(JPI$_CPUTIM,,,TIME1)
C     ------------------------------------------------------------------
      TMP4 = 0.0
      DO I = 1, 1000000
        TMP4 = TMP4 + EXP(-REAL(I))
      ENDDO
      WRITE(*,*) ' Sum of loop (REAL*4): ', TMP4
C     ------------------------------------------------------------------
      CALL LIB$GETJPI(JPI$_CPUTIM,,,TIME2)
      WRITE(*,*) ' Time of loop (REAL*4): ', TIME2 - TIME1
      CALL LIB$GETJPI(JPI$_CPUTIM,,,TIME1)
C     ------------------------------------------------------------------
      TMP8 = 0.0
      DO I = 1, 1000000
        TMP8 = TMP8 + EXP(-REAL(I))
      ENDDO
      WRITE(*,*) ' Sum of loop (REAL*8): ', TMP8
C     ------------------------------------------------------------------
      CALL LIB$GETJPI(JPI$_CPUTIM,,,TIME2)
      WRITE(*,*) ' Time of loop (REAL*8): ', TIME2 - TIME1
      CALL LIB$GETJPI(JPI$_CPUTIM,,,TIME1)
C     ------------------------------------------------------------------
      TMP16 = 0.0
      DO I = 1, 1000000
        TMP16 = TMP16 + EXP(-REAL(I))
      ENDDO
      WRITE(*,*) ' Sum of loop (REAL*16): ', TMP16
C     ------------------------------------------------------------------
      CALL LIB$GETJPI(JPI$_CPUTIM,,,TIME2)
      WRITE(*,*) ' Time of loop (REAL*16): ', TIME2 - TIME1
C     ------------------------------------------------------------------
      END


 This example program tests different size floats. The program 
 is highly compute bound, the only results related I/O is a 
 single WRITE statement (added so the compiler wouldn't optimize 
 away all computations).

 Other useful measurements like the number of page faults (excessive 
 number may indicate bad memory access patterns) can be performed 
 by using other 'item codes' instead of JPI$_CPUTIM in the call to 
 LIB$GETJPI. 

 To get other 'item codes' names, look at the documentation of
 LIB$GETJPI and SYS$GETJPI, or extract text module '$JPIDEF' from
 the text library SYS$LIBRARY:FORSYSDEF.TLB by:

   LIBRARY/TEXT/EXTRACT=$JPIDEF/OUTPUT=TMP.TMP SYS$LIBRARY:FORSYSDEF.TLB

 

 Measuring performance on UNIX
 =============================

 The "time" command
 ------------------
 Syntax:

     time 

 Runs an executable file and displays information on system 
 resources used.

 Try something simple like:  time ls

 The (tcsh shell) default fields (from left to right) are:

     suffix  description
     ======  ========================================================
       u     CPU time (seconds) spent in computing
       s     CPU time (seconds) spent in I/O and other activities
             Elapsed time (seconds)
             Total CPU time / elapsed time (may exceed 100%)
       k     Shared + non-shared memory used (Kbytes)
       io    Number of input operations + Number of output operations 
       pf    Number of major page faults (pages brought from disk)
       w     Number of waits
Return to contents page