A concrete floating-point

 4-2  FLOATING-POINT NUMBERS - CONCRETE EXAMPLE 
 **********************************************

 IEEE/REAL*4
 -----------
 To make things more concrete let's look at a typical floating-point
 representation for a REAL (SINGLE PRECISION) - the single-precision 
 unextended IEEE (ANSI/IEEE Std 754-1985) that became a de facto 
 standard on workstations. The '*4' is a non-standard notation that 
 says that 4 bytes are allocated for the representation. 

 A schematic description of the representation follows, the 4 bytes 
 contain 32 bits that are partitioned into 3 parts (the letter 'S' 
 in the left part is short for 'Sign')


  +-+--------+-----------------------+
  |S|  exp   |       fraction        |
  +-+--------+-----------------------+              Direction of 
   ^                                ^     <---  increasing addresses
 Bit31                             Bit0        (See discussion below)


 A formula that gives the value of this float is:

   Value = (-1)**S  X  1.fffffffffffffffffffffff  X  2**(exp - 127)

 The most significant bit (MSB) is the sign bit, it is 0 for a positive 
 number and 1 for a negative number.

 The next 8 bits describe the exponent which is BIASED by 127 (see the 
 formula above), so the range of values is [-127, 128]

 The remaining 23 bits are taken as the binary digits of a binary fraction 
 that has a "whole part" = 1 (see the formula above), this condition is just 
 the normalization condition. 

 An IEEE normalized mantissa always has a leading '1' bit, so it is really 
 redundant and can be always omitted (an old 'trick' attributed to David 
 Goldberg), it 'saves' one bit that can be used to improve the precision.

 The following program may help you examine the structure of REAL on
 your machine, it is based on the plausible assumption that integers are 
 represented in two's complement format. 

 Of course we could use the Z edit descriptor, but it is not standard 
 FORTRAN 77, and so may not be implemented by all compilers.


      PROGRAM RELREP
C     ------------------------------------------------------------------
      REAL
     *              X
C     ------------------------------------------------------------------
      WRITE(*,*) ' Enter a REAL number: '
      READ(*,*) X
      CALL BINREP(X)
C     ------------------------------------------------------------------
      END


      SUBROUTINE BINREP(INT)
C     ------------------------------------------------------------------
      INTEGER
     *              I,
     *              INT
C     ------------------------------------------------------------------
      CHARACTER
     *              B*32
C     ------------------------------------------------------------------
      IF (INT .GE. 0) THEN
        B(1:1) = '0'
        DO I = 32, 2, -1
          IF (MOD(INT,2) .EQ. 0) THEN
            B(I:I) = '0'
          ELSE
            B(I:I) = '1'
          ENDIF
          INT = INT / 2
        ENDDO
      ELSE
        B(1:1) = '1'
        INT = ABS(INT + 1)
        DO I = 32, 2, -1
          IF (MOD(INT,2) .EQ. 0) THEN
            B(I:I) = '1'
          ELSE
            B(I:I) = '0'
          ENDIF
          INT = INT / 2
        ENDDO
      ENDIF
C     ------------------------------------------------------------------
      WRITE(*,*) '   ', B(1:8),' ', B(9:16),' ', B(17:24),' ', B(25:32)
      WRITE(*,*) '   ........ ........ ........ ........ '
      WRITE(*,*) '   21098765 43210987 65432109 87654321 '
      WRITE(*,*) '     3          2          1           '
      WRITE(*,*) ' '
C     ------------------------------------------------------------------
      RETURN
      END



 Special numbers
 --------------- 
 Using normalized mantissas raises a little problem, how to represent 
 zero when the mantissa is not allowed to have zero value? 
 The IEEE solution is to represent the number zero by a zero fraction 
 and exponent, but no condition is imposed on the SIGN BIT, so we 
 have two 'zeros' +0 and -0!

 Remember that the exponent is biased by 127, so that a zero exponent 
 really means that the binary fraction is 'multiplied' by (2 ** (-127)),
 in other words, the minimal exponent is reserved to represent zero.

 There is also an internal representation for 'INFINITY', it consists
 of the maximal exponent = 255 (128 after debiasing) and all fraction 
 bits = 0. So we have also two 'infinities' one positive and one negative.

 An even stranger phenomenon is the class of bit patterns called NaNs, 
 a NaN has exponent = 255 (128 after debiasing) and fraction bits
 which are not all 0. NaN is short for 'Not A Number'.

 The special numbers (except zero) were invented in order to implement 
 NON-STOP ARITHMETIC, instead of aborting the program in the case an 
 intermediary calculation gives a bad result, the result is replaced 
 by the appropriate special number and computation continues.

 IEEE arithmetic implements an extension of the real numbers system, 
 the quantities +INFINITY, -INFINITY and the NaNs are added to the 
 real numbers, and arithmetic operations involving them are defined 
 in a plausible way. Many users find this extension confusing and 
 not very useful.



 The 'representation density' of IEEE/REAL*4
 -------------------------------------------
 What is the spacing between two consecutive floating-point numbers?

 Positive FPN are the product of a 'normalized' binary fraction with
 23 binary digits, and  (2 ** e), where  e  is in [-126,127].

 Remember that the exponents -127 and +128 are reserved to represent 
 zero and infinity respectively. 

 The 'normal' FPNs can be partitioned into 254 disjoint sets, one for 
 each possible exponent, each set containing  (2 ** 23)  numbers, one
 for each possible binary fraction of length 32.

 The spacing between consecutive numbers belonging to the same set,
 is the same, and equals  (2 ** (-23)) * (2 ** e) = 2 ** (e -32).

 It is clear that the spacing increases when e (and the magnitude
 of the number) increases.

 The minimal positive FPN is  (+1.0) * (2 ** (-126)) = 2 ** (-126),
 the spacing at that region is  (2 ** (-126 - 32))   = 2 ** (-158).

 We see that the minimal positive FPN is MUCH LARGER than the 
 local spacing.



 The number space of IEEE/REAL*4
 -------------------------------
 If we will translate the binary data from previous sections to 
 decimal, we will find the range of numbers that can be represented 
 by the IEEE REAL*4 is:

        (-3.4 X 10**+38, +3.4 X 10**+38)

 Because the minimal FPN is so much larger than the nearby spacing, 
 it is more instructive to look at that range as the union of three 
 discrete segments:

        (-3.4 X 10**+38, -1.2 X 10**-38)

        (0.0)

        (+1.2 X 10**-38, +3.4 X 10**+38)

 In this floating-point representation we have a finite number of numbers 
 filling the three ranges, two of them with variable 'density'. 


  +---------------------------------------------------------------------+
  |     SUMMARY                                                         |
  |     =======                                                         |
  |     1) IEEE/REAL*4 = 1 Sign bit, 8 exponent bits, 23 mantissa bits  |
  |     2) There are all kinds of 'strange numbers'                     |
  |     3) The number space is discrete, made of three parts, and has   |
  |         maximal 'density' near zero                                 |
  +---------------------------------------------------------------------+
Return to contents page