Compilation & linking

 5-2 - Compilation and linking 
 *****************************
 Compilation and link editing are the two main stages in the process
 of translating a High Level Language (HLL) program to machine code 
 that can be executed on a CPU.

 A small glossary
 ----------------
 MACHINE INSTRUCTIONS  Simple commands that the hardware can execute directly
 GENERAL REGISTERS     (or just REGISTERS) are small very fast memory units 
                       built into the CPU, registers are used in almost all
                       CPU operations.
 ADDRESSING MODES      The methods in which a memory address can be specified 
                       in a machine instruction.
 PROGRAM COUNTER       Special register keeping track of the memory location 
                       of the next machine instruction to be executed.
 CISC ARCHITECTURE     Characterized by: full set of register-to-memory 
                       instructions, variable length instructions, many
                       addressing modes, instructions typically decoded
                       and interpreted by a microprogram.
 RISC ARCHITECTURE     Characterized by: simple fixed-length instructions
                       (load/store), few addressing modes, instructions 
                       typically decoded and executed directly by hardware.
 ENTRY POINT           Memory address where execution of a program/routine
                       starts.
 CONTROL TRANSFER      Stopping sequential execution of program code, and
                       jumping to a specified code location, performed by
                       changing the contents of the program counter.
 JUMP                  Unconditional control transfer.
 BRANCH                Conditional control transfer.
 MEMORY REFERENCE      Accessing a specified place in memory that stores
                       some variable or machine instructions.


 Machine instructions
 --------------------
 The CPU is an electronic machine, which constantly fetches 
 MACHINE INSTRUCTIONS from the main memory and executes them.

 Machine instructions are simple commands that the hardware can 
 execute directly, typical (and basic) instructions: 

    1) Load a word from a certain location in main memory to a 
       certain GENERAL REGISTER
    2) Perform some simple arithmetic operation (add, multiply
       compare, etc) on data kept in a certain pair of registers
    3) Store the contents of a certain register to a certain
       location in main memory
    4) Conditional jump (branch) or unconditional jump to a
       pre-determined location in the program (this is really
       just a memory location, as programs are loaded into 
       memory before execution).

 Computer CPUs, either hardwired or using microcode are very limited, 
 every instruction (or micro-instruction) must be implemented by 
 dedicated digital logic electronic circuits, and so can't be too
 complex.

 Modern CPUs are usually designed according to the RISC (Reduced
 Instruction Set Computer) paradigm, RISC CPUs can execute only a 
 small set of instructions, and so are simpler and faster. Of course 
 the RISC instructions are choosed so any needed operation can be 
 performed by a suitable sequence of them.

 The older CISC paradigm (Complex Instruction Set Computers), takes
 the opposite view, the CPU instruction set is made very large, every
 instruction can be performed from every general register, in any
 addressing mode (orthogonality), but even CISCs can execute only 
 relatively simple machine instructions.

 A common opinion is that modern RISC CPUs are faster because it is 
 more efficient to build complex operations from smaller building 
 blocks, and the CISC wealth of instructions and modes is just slowing 
 the CPU and isn't useful to HLL compilers.

 Other people maintain that RISCs are now faster than CISCs, not 
 because of the inherent superiority of their paradigm, but because 
 RISCs became fashionable and were built using modern technologies 
 (DEC NVAX chip is a modern fast CISC).


 Addressing modes
 ----------------
 Usually when we are loading/storing a register or jumping/branching 
 we need to specify a memory address, usually some (or all) of these 
 addressing modes are available:

    Addressing mode      Location of operand
    -----------------    ----------------------------------------------
    Immediate/Literal    In the machine instruction itself
    Direct/Absolute      Memory location whose address is specified in
                         the instruction
    Indirect             Memory location whose address is in a memory
                         location whose address is specified in the
                         machine instruction.
    Register             In a general register
    Register-indirect    Memory location whose address is in a register
                         whose name is specified in the instruction
    Register-deferred    Same as register-indirect
    Autoincrement        Register-indirect followed by an automatic increment 
                         of the register contents by the operand size 
    Autodecrement        Register-indirect preceded by an automatic decrement 
                         of the register contents by the operand size
    Displacement         Memory location whose address is the sum of some
                         register's contents and a value specified in 
                         the machine instruction 
    Relative             Just like displacement mode, but the register
                         used is the program counter (with updated value)
    Indexed              Memory location whose address is the sum of some
                         register's contents and another register's 
                         contents multiplied by a small constant that 
                         equals a data item size
    Indexed-indirect     Memory location whose address is in the memory
                         location whose address is the sum of some 
                         register's contents and another register's 
                         contents multiplied by a small constant that
                         equals a data item size
    Indexed-deferred     Same as indexed-indirect


 Remarks:

    Displacement mode is sometimes called indexed mode, in that
    case what we called indexed mode is probably unavailable.

    The program counter is similar to other registers, but 
    changing its contents means performing a jump/branch. 

    The immediate/literal addressing mode is clearly not 
    suitable for memory variables.


 FORTRAN programs and the underlying machine
 -------------------------------------------
 You can't really understand programming and how compilers works, 
 without examining the way the HLL source code is translated into 
 machine code. 

 Most compilers offer an option to generate an assembly listing of 
 the translated program, this is often the only way to check and 
 study how the compiler handles subtle points, e.g. if it does
 shortcut-evaluations, handles properly DO-loop termination etc.

 A small example program:


      PROGRAM EXMPL
      INTEGER       INT
      READ (*,*) INT 
      INT = INT + 1
      WRITE (*,*) INT
      END


 And this is the translation to the classical assembly 
 language of the VAX, with a little editing to make 
 it more readable:
 

            .TITLE  EXMPL
            .IDENT  01
    
        0000    .PSECT      $CODE
    
 ! PROGRAM EXMPL
    
        0000  EXMPL::
        0000    .WORD       ^M
        0002    MOVAL       $LOCAL, R11
    
 ! READ (*,*) INT 
    
        0009    MNEGL       #4, -(SP)
        000C    CALLS       #1, FOR$READ_SL
        0013    PUSHAL      INT(R11)
        0015    CALLS       #1, FOR$IO_L_R
        001C    CALLS       #0, FOR$IO_END
    
 ! INT = INT + 1 
    
        0023    ADDL3       #1, INT(R11), R12
    
 ! WRITE (*,*) INT 
    
        0027    MNEGL       #1, -(SP)
        002A    CALLS       #1, FOR$WRITE_SL
        0031    PUSHL       INT%R12
        0033    CALLS       #1, FOR$IO_L_V
        003A    CALLS       #0, FOR$IO_END
    
 ! END 
    
        0041    MOVL        #1, R0
        0044    RET     
            .END



 Symbolic memory locations
 -------------------------
 HLL programmers don't have to work directly with all these bewildering
 addressing modes, the compiler uses them in a user-transparent way to
 create the HLL language constructs.

 HLLs (and even assembly) implements the following 'code mechanisms': 

    1) Explicit/Implicit statement labels, used as targets 
       for control transfer.

    2) Memory variables - memory locations of suitable size,
       that can be referenced by symbolic names instead of
       memory addresses.

 The most direct way to handle statement labels and memory variables 
 is to compute and keep the offsets (e.g. from the program beginning) 
 of the relevant machine instructions while translating the HLL to 
 machine code.

 The two 'mechanisms' are actually implemented using suitable addressing 
 modes, and the above-mentioned 'offset bookeeping'.


 The relocation problem
 ----------------------
 A program usually contains many references to memory locations,
 either as operands or as targets for jumps/branches.

 It is convenient to have the compiler generate machine code that 
 can be easily combined with other pieces of machine code and form 
 one program, it is also convenient that a program will be able to 
 run no matter where in the main memory it was loaded. 

 What we are looking for is a way to write machine code that don't
 use explicit memory addresses, that way the code will be able to
 execute no matter where in memory it was loaded. That property is
 called RELOCATABILITY.

 To achieve that, all memory references contained in the program 
 must be either invariant to the memory location in which they will 
 be loaded, or some kind of an address translation process must 
 take place.

 Possible methods to achieve location invariance are referring to 
 memory variables using relative mode, and using displacement mode 
 with some register (called base register) pointing to the beginning 
 of the loaded program. 

 The relative mode method works because when an instruction starts
 executing the program counter points to it, if we make memory 
 references relative to the program counter they will be independant 
 of the place where the program was loaded. 

 In the displacement mode method we just have to load the base register 
 with the address of the loaded program beginning.


 The relocation problem between modules
 --------------------------------------
 Using the relative/displacement addressing modes solves most of the
 relocation problems in a single compilation unit. 

 In FORTRAN ....

 There are some cases we will have to translate special memory references 
 by adding to them the memory address of the beginning of the program.



 Symbol resolution
 -----------------



 Programs starts execution when the program counter is loaded with the 
 address of the main entry point,
Return to contents page