Floating-point numbers - General view

 4-1  FLOATING-POINT NUMBERS - GENERAL VIEW 
 ******************************************

 The real number system
 ----------------------
 Scientific and engineering calculations are performed in the REAL 
 NUMBER SYSTEM, a highly abstract mathematical construct. 

 A real number is by definition a special infinite set of rational 
 numbers (integer fractions) - the so called Dedkind Cuts or an 
 equivalent formulation.  The arithmetical operations are defined 
 between such sets and is a natural extension of the arithmetic of 
 rational numbers.

 The real numbers have wonderful properties:

    1) There is no lower or upper bound, in simple language 
       they go from minus infinity to plus infinity. 

    2) Infinite density - there is a real number between 
       any two real numbers.

    3) A lot of algebraic axioms are satisfied, e.g. the
       'field axioms'.

    4) Completeness - they contain all their 'limit points'
       (the limit of every converging sequence is also 'real').

    5) They are ordered.

 Many of these properties are not satisfied by computer arithmetic,
 see the chapter on errors in floating-point computations for a short 
 review on properties that stay true in floating-point arithmetic. 

 In order to crunch quickly a lot of numbers, computers need a fixed 
 size representation of real numbers, that way the hardware can 
 efficiently perform the arithmetical operations.

 The problems arising from using a fixed size representation are the 
 subject of the following chapters.
 

 Finite number systems are discrete
 ----------------------------------
 If you use a fixed size representation, let's say N binary digits (BITS)
 long, you have at most  2**N  bit-patterns, and so at most 2**N 
 representable numbers.

 Such a finite set will have to be bounded - have a largest number and a
 smallest number. We have already one problem, our computations must not
 exceed these bounds.

 In every bounded segment, there are infinitely many real numbers, but we 
 have at most 2**N available bit-patterns, so many real numbers will have 
 to be represented by one bit-pattern. 

 Of course one bit-pattern can't represent many numbers equally well, it 
 will represent one of them exactly and the others will be misrepresented.

 We call numbers that can be represented exactly, FLOATING-POINT NUMBERS 
 (FPN), the term 'real numbers' will be reserved for the mathematical 
 constructs.
 

 Roundoff errors are unavoidable 
 -------------------------------
 Before we begin to study actual representations of real numbers, 
 let us develop a little an idea mentioned in the previous section.

 We said that in a finite number system, many real numbers will have 
 to be represented by one bit-pattern, and that bit-pattern will 
 represent exactly only one of them. In other words many real numbers 
 will be 'rounded off' to that one bit-pattern.

 This 'rounding off' may occur whenever we will enter a real number to
 the computer (except in the rare case we will enter an exactly 
 representable number). 

 The same 'rounding off' may occur whenever we perform an arithmetical 
 operation. The result of an arithmetical operation usually will have 
 more binary digits than its operands, and will have to be converted 
 to one of the 'allowed' bit-patterns.

 To make this more concrete, let's have an example using base 10
 real numbers, and suppose that only two digit mantissas are allowed
 (the fractional parts may have only 2 decimal digits):

    0.12E+02 + 0.34E+00 = 12.00E+00 + 0.34E+00 = 12.34E+00 ==> 0.12E+02

 This example is a bit artificial and incompletely defined (in our fixed 
 representation, only the size of the fractional part was specified, 
 the exponents were left unspecified), but the idea is clear, we can 
 see that computer arithmetic has to replace almost every number and 
 temporary result by a rounded form.

 Instead of computing:

        X + Y

 We will really compute:

        round(round(X) + round(Y))


 The function 'round' can't be specified in general, it depends on the
 representation and the floating-point arithmetical algorithms we use, 
 see the chapter 'radix conversion and rounding' for more information.

 A possible implementation of round() for decimal floating-point numbers
 (represented in radix 10) is:

   e = INT(LOG10(X) + 1.0)               (number of decimal digits in X)

              INT(X * (10**(p-e)) + 0.5)
   round(X) = ----------------------       
                      10**(p-e)

 The parameter p is the number of decimal digits in the representation. 
 Note that multiplying and dividing by (10**n) are just shifts of the
 decimal point, and not error generating arithmetic operations.

 Such seemingly complicated formulas can be implemented efficiently
 (in radix 2) in hardware or reduced to a very small micro-code program
 executed by the CPU.

 In the following sections we will see that roundoff errors are an endless
 source of errors, some of them unexpectedly large.

 By the way, the distinction between real and floating-point numbers can
 be summarized symbolically in our new notation by:

        FPN = round(REAL)


 A little basic theory
 ---------------------
 Every real number x can be written in the form:

        x  =  f  X  (2 ** e)

 Where 'e' is an integer called the EXPONENT, and 'f' is a binary
 fraction called the MANTISSA. The mantissa may satisfy one of the 
 normalization conditions:

        1     <=  |f|   <  2          (IEEE)

        1/2   <=  |f|   <  1          (DEC)

 The mantissa is then said to be a NORMALIZED. 

 The IEEE normalization condition is equivalent to the requirement 
 that the MOST SIGNIFICANT BIT (MSB) in the mantissa = 1. 

 The DEC condition requires the two most significant bits to be 0,1.

 On IBM 360, IBM 370 and Nova (Data General) computers, the base of the 
 exponent was 16 (it gives a larger range at the cost of precision): 

        x  =  f  X  (16 ** e)

 The normalization condition was that the first HEX digit of the fraction
 was not equal to 0, i.e. not all first 4 binary digits were 0.


 The advantages of normalizing floating-point numbers are:

    1) The representation is unique, there is exactly one way to 
       write a real number in such a form.

    2) It's easy to compare two normalized numbers, you separately
       test the sign, exponent and mantissa.

    3) In a normalized form, a fixed size mantissa will use all 
       the 'digit cells' to store significant digits.

    4) The IEEE and DEC normalization conditions makes the 
       representation always start with a 1-bit, this bit can
       be omitted, and its place used for data. The omitted
       bit is called the "hidden bit".

 The normalized representation is used in almost all floating point
 implementations, 'denormalized numbers' are used only to minimize
 accuracy loss due to underflow (see next chapter).

 Just like with rounding, we will have to normalize after arithmetical
 operations, the result wouldn't be normalized in general.


 Floating Point numbers in practise
 ----------------------------------
 In our finite machines, we can keep only a finite number of the binary 
 digits of 'f' and 'e', let's say 'm' and 'n' digits respectively.

 The vendor predetermine a few combinations of 'm' and 'n', usually one
 or two combinations that the hardware executes efficiently, and maybe 
 one more that gives better precision.


 The following table compares some floats used in practice, the REAL*n
 notation is a common extension to FORTRAN, 'n' is the number of bytes 
 used in the representation. The representation radix, size (in bits) 
 of the various parts composing the floating-point number, and the 
 exponent bias are given.

 The number of bits in the fraction part is counted without the 
 "hidden bit", if normalized mantissas are used, so the sizes here
 are "physical" rather than "logical".

   Table of float types (incomplete)
   =================================

 Float name          Radix  Sign  Exponent  Fraction   Bias
 ----------          -----  ----  --------  --------  -----
 IBM 370:
*  REAL*4             16     1        7        24        64  0.f * 16**(e-64)
*  REAL*8             16     1        7        56        64

 VAX:
*  REAL*4 (F_FLOAT)    2     1        8        23       128  0.1f * 2**(e-128)
*  REAL*8 (D_FLOAT)    2     1        8        55       128  0.1f * 2**(e-128)
*  REAL*8 (G_FLOAT)    2     1       11        52      1024  0.1f * 2**(e-1024)
*  REAL*16(H_FLOAT)    2     1       15       113     16384  0.1f * 2**(e-16384)

 Cray:
   Single precision    2     1       15        48     16384
   Double precision    2     1       15        96     16384

 IEEE
*  REAL*4              2     1        8        23       127  1.f * 2**(e-127)
   extended            2     1       11+       31+
*  REAL*8              2     1       11        52      1023  1.f * 2**(e-1023)
   extended            2     1       15+       63+
   REAL*10             2     1       15        64     16383

 Intel (IEEE):
*  Short real          2     1        8        23       127  1.f * 2**(e-127)
*  Long real           2     1       11        52      1023  1.f * 2**(e-1023)
   Temp real           2     1       15        64     16383  0.f * 2**(e-16384)

 MIL 1750A:
   REAL*4                   None      8        24      None  f * 2**e
   REAL*8                   None      ?        ??      None

 HP 21MX:
 Varian:
 Honeywell:


 Remarks:

   1) Formats that use a sign bit (all except MIL 1750A), 
      use the sign convention:   0 = +, 1 = -
      MIL 1750A uses a 2's complement mantissa with a 
      2's complement exponent.

   2) '#' at the first column means that normalized mantissas 
      are used. Note that on IBM 370 the first hexadecimal 
      digit of the fraction (4 bits), couldn't be zero.


 An important note
 -----------------
 The next chapter will provide a detailed example that will make the
 abstract concepts more clear. To simplify our discussion, we will 
 give an incomplete treatment of this highly technical subject, and 
 with no proofs. 

 Readers interested in a deeper treatment of these subjects are 
 referred to:

   Goldberg, David 
   What Every Computer Scientist Should 
    Know about Floating-Point arithmetic
   ACM Computing Surveys
   Vol. 23 #1  March 1991, pp. 5-48


  +---------------------------------------------------------------------+
  |     SUMMARY                                                         |
  |     =======                                                         |
  |     1) x = f X (2 ** e)     2 > |f| => 1    b is integer            |
  |     2) There are a lot of float types                               |
  |     3) IEEE/REAL*4 = 1 Sign bit, 8 exponent bits, 23 mantissa bits  |
  +---------------------------------------------------------------------+
Return to contents page